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REVISION  HISTORY 


Revision  1 

•  Incorporates  performance  data  collected  from  benchmark  experiments 

•  Uses  NUWC’s  mapping  for  the  first  stage  of  beamforming 

•  Includes  analysis  of  new  4K  FFT  algorithm  mode 

NOTE:  This  report  replaces  all  earlier  versions  of  Project  Report  CBASS-1,  dated  25  September  2001,  in 
its  entirety. 


EXECUTIVE  SUMMARY 


This  project  report  details  the  MIT  Lincoln  Laboratory  computational  perfomiance  analysis  of  the 
CBASS  BSAR.  We  describe: 

•  the  BSAR  algorithm,  in  both  the  original  mode  and  the  new,  default  mode 

•  the  algorithm’s  computational  workload  for  both  modes 

•  the  12-Hammerhead  DR  on  which  the  BSAR  algorithm  is  executed 

•  a  proposed  mapping  of  the  algorithm  onto  the  DR 

•  an  estimated  execution  time  for  the  algorithm  in  both  modes  using  the  proposed  mapping 

•  a  memory  usage  analysis 

Since  the  publication  of  the  original  version  of  this  document,  several  key  developments  in  the 
CBASS  program  have  occurred  that  significantly  affect  the  BSAR  design.  These  developments,  which  will 
be  detailed  in  this  first  revision  of  the  report,  are: 

•  MIT  Lincoln  Laboratory  has  completed  its  kernel  benchmark  experiments.  These  bench¬ 
marks,  which  measured  the  computation  and  communication  performance  of  a  commer¬ 
cially  available  Hammerhead  board  very  similar  to  the  DR,  allow  us  to  use  actual  measured 
performance  instead  of  estimates  in  our  execution  time  analysis. 

•  These  benchmark  results  revealed  that  the  throughput  of  the  Hammerhead  on  small  matrix 
multiplications  is  radically  lower  than  we  originally  estimated.  This  lower  computation 
throughput  made  the  mapping  proposed  in  the  original  document  for  the  first  step  of  beam- 
forming  completely  ineffective.  NUWC  has  since  proposed  a  mapping  for  this  step  that 
allows  us  to  hide  the  computation  time  behind  memory  access.  The  NUWC  mapping,  which 
allows  for  the  use  of  a  more  computationally  efficient  matrix  multiplication  routine,  also 
improves  the  execution  time  of  this  step  beyond  what  we  projected  with  the  estimated 
matrix  multiply  throughput. 

•  A  new  algorithm  mode  has  been  adopted  as  the  default  mode.  In  the  new,  default  mode,  the 
DR  performs  4K-point  FFTs,  instead  of  8K-point  FFTs,  in  the  initial  computational  stage. 
This  change,  along  with  the  change  in  the  size  of  the  overlap  between  successive  data 
blocks,  allows  the  time  interval  for  each  data  block  to  exactly  match  the  AGC  loop  time. 
Furthermore,  the  number  of  beam  bands  being  formed  has  been  reduced  from  76  to  14. 
These  changes  significantly  increase  the  amount  of  spare  processing  throughput  (from  62% 
to  220%). 

Our  analysis  indicates  that  the  DR  will  be  able  to  handle  the  BSAR  algorithm  in  both  the  original 
and  the  new,  default  modes  with  sufficient  spare  processor  capacity.  Furthermore,  there  is  sufficient  mem¬ 
ory  for  the  various  input,  intermediate,  and  output  data  products,  although  some  of  the  memory  margin  will 
be  consumed  in  the  original  algorithm  mode. 
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1.  INTRODUCTION 


The  goals  of  the  CBASS  (Common  Broadband  Advanced  Sonar  System)  program  are  centered 
around  the  upgrade  of  the  Mk  48  heavyweight  torpedo’s  sensors  and  signal  processors.  The  Mk  48  Mod  7 
torpedo  will  use  broadband  signals  and  signal  processing  techniques  to  improve  its  ability  to  track  a  target 
in  a  littoral  environment  and  in  the  presence  of  countermeasures  while  maintaining  its  effectiveness  in 
deep  water.  The  torpedo’s  use  in  a  littoral  setting  will  expose  it  to  interference  from  civilian  shipping  as 
well  as  greatly  increased  reverberation  noise  from  the  ocean  floor  and  surface. 

To  support  the  use  of  a  wider  bandwidth,  the  Mod  7  torpedo  will  have  a  BSAR  (broadband  sonar 
analog  receiver)  that  will  perform  initial  filtering,  decimation,  and  conventional  bcamforming  of  the  input 
data  from  the  receivers.  The  DR  (digital  receiver)  portion  of  the  BSAR  will  perform  these  operations  on  a 
custom-built  embedded  processor.  The  design  of  the  DR  is  based  on  12  ADSP-21160  “Hammerhead” 
DSPs  (digital  signal  processors),  with  an  aggregate  peak  processing  throughput  of  5.76  Gflop/s  (billion 
floating-point  operations  per  second). 

MIT  Lincoln  Laboratory’s  role  in  this  aspect  of  the  CBASS  program  was  to  evaluate  the  DR  and  its 
ability  to  perform  the  BSAR  signal  processing  in  real-time.  To  this  end,  we  have  developed  a  processor 
assessment  methodology  in  which  we  consider  several  candidate  mappings  of  the  signal  processing  algo¬ 
rithm  onto  the  processor  architecture  and  progressively  refine  our  estimate  of  the  processor  performance. 

In  the  original  version  of  this  report,  we  analyzed  the  computational  workload  associated  with  the 
BSAR  algorithm,  detailed  several  candidate  mappings,  and  estimated  the  performance  of  the  DR  using 
each  mapping.  We  also  analyzed  the  sensitivity  of  the  estimated  performance  to  the  assumptions  we  made 
about  the  efficiency  of  the  DR  on  various  computation  and  communication  kernels.  We  have  since  mea¬ 
sured  the  efficiencies  of  these  kernels  on  a  commercially  available  board  whose  architecture  closely  resem¬ 
bles  that  of  the  DR,  and,  for  this  revision,  have  incorporated  these  performance  numbers  in  our  analysis  to 
generate  a  more  accurate  estimated  execution  time. 

Another  significant  change  since  the  original  version  of  this  report  is  the  fact  that  a  second  algorithm 
mode  for  the  BSAR  was  added.  In  the  original  algorithm  mode,  the  BSAR  collected  data  for  64.3  ms  (with 
a  nominal  100  kHz  sampling  rate)  before  processing  was  begun.  This  relatively  long  cycle  time  not  only 
resulted  in  a  slow  update  rate  for  the  torpedo  guidance  loop,  it  also  was  not  a  whole  multiple  of  the  AGC 
(automatic  gain  control)  loop  cycle  time,  resulting  in  a  mismatch  between  the  AGC  loop  and  the  BSAR. 
The  new  algorithm  mode  collects  data  for  20.5  ms  (with  a  nominal  100  kHz  sampling  rate)  before  begin¬ 
ning  processing.  This  cycle  time  also  meshes  perfectly  with  the  AGC  loop  time.  The  analysis  in  this  revi¬ 
sion  will  cover  both  the  original  and  the  new  algorithm  modes. 

In  this  revision,  we  focus  our  analysis  on  the  proposed  algorithm  mapping,  which  has  been  adopted 
by  the  BSAR  design  teams.  Our  analysis  indicates  that,  for  the  original  algorithm  mode,  the  DR  will  be 
able  to  perform  the  BSAR  signal  processing  in  approximately  39.8  ms,  which  is  within  the  42.9  ms  avail¬ 
able,  and  for  the  new  algorithm  mode,  the  DR  will  be  able  to  perform  the  signal  processing  in  approxi¬ 
mately  6.4  ms,  which  is  within  the  13.7  ms  available.  The  other  mappings,  which  were  designed  to  address 
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shortcomings  of  the  selected  mapping  but  were  found  to  result  in  worse  performance,  are  not  considered  in 
this  revision. 
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2.  BSAR  ARCHITECTURE 


The  Mk  48  Mod  7  heavyweight  torpedo’s  signal  processing  chain  is  made  up  of  the  following  ele¬ 
ments  (see  Figure  1): 

•  the  pre-amplifier 

•  the  ADCs  (analog  to  digital  converters) 

•  the  BSAR 

•  the  GCB  (guidance  and  control  box) 

The  BSAR  itself  consists  of: 

•  the  input  FPGA  (field-programmable  gate  array) 

•  the  DR 

•  the  core/control  processor  (not  shown) 

•  the  output  FPGA 


Figure  1 :  Mk  48  Mod  7  heavyweight  torpedo  signal  processing  chain 


The  BSAR  is  responsible  for  filtering,  decimating,  and  beamforming  the  digitized  sonar  data  from 
the  ADCs.  This  signal  processing  is  performed  by  the  DR.  The  processed  data  are  then  sent  to  the  GCB  for 
further  processing  and  detection. 

The  DR  consists  of  12  Hammerheads,  each  with  a  peak  processing  throughput  of  480  Mflop/s  (mil¬ 
lion  floating-point  operations  per  second).  The  12  Hammerheads  are  connected  to  the  ADCs  through  an 
input  FPGA,  which  also  converts  the  input  data  from  fixed-point  to  floating-point  representation.  The  DR 
communicates  the  processed  data  to  the  GCB  through  an  output  FPGA.  Finally,  a  Texas  Instruments 
TMS320C31  processor  serves  as  the  core  processor.  A  block  diagram  of  the  BSAR  is  illustrated  in 
Figure  2. 
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link  to  core 

- ^  memory  bus 

- ^  inter-processor  link  port  connections 

processor 

^  other  link  port  connections 

Figure  2:  BSAR  block  diagram 


The  Hammerheads  are  organized  into  three  clusters  of  four  Hammerheads  each.  Each  cluster  has: 

•  one  Hammerhead  connected  to  the  input  FPGA  via  a  link  port 

•  another  Hammerhead  connected  to  the  output  FPGA  via  a  link  port 

•  a  third  Hammerhead  connected  to  the  BES  (broadband  evaluation  system)  via  a  link  port 

•  the  last  Hammerhead  connected  to  the  interface  to  the  core  processor 

The  Hammerheads  are  connected  via  link  ports  into  a  two-by-six  torus.  In  addition  to  the  connec¬ 
tions  forming  a  torus,  each  Hammerhead  is  directly  connected  to  the  other  three  Hammerheads  in  its  clus¬ 
ter  via  a  link  port  (see  Figure  3). 
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The  four  Hammerheads  in  a  eluster  share  a  memory  bus  connecting  them  to  34  MB  (mega  byte, 

defined  as  1,000,000  bytes)  of  SDRAM  (synchronous  dynamic  random-access  memory)^  and  8  MB  of 
boot  PROM  (programmable  read-only  memory).  The  various  elements  of  the  DR  will  be  described  in 
greater  detail  below. 

2.1  ADSP-21160  HAMMERHEAD 

The  Hammerhead  processor  consists  of  (see  Figure  4): 

•  a  core  processor 

•  524  KB  (kilo  byte,  defined  as  1,000  bytes)  of  on-chip  SRAM^  (static  random-access  mem¬ 
ory) 

•  an  independent  I/O  (input/output)  processor 

•  an  external  port 


1 .  Given  our  definition  of  1  MB  =  1,000,000  bytes,  2^^  bytes  =  33.6  MB.  Sources  that  define  1  MB  =  2^^ 
bytes  will  arrive  at  an  SDRAM  capacity  of  32  MB. 

2.  The  on-chip  capacity  of  the  Hammerhead  is  2  *  or  524,288,  bytes.  Because  we  define  KB  as  1 ,000  bytes, 
the  on-chip  memory  capacity  of  the  Hammerhead  is  524  KB. 
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Figure  4:  ADSP-21160  Hammerhead  internal  architecture 


These  components  are  connected  via  two  independent  address  buses  and  two  independent  data 

buses. 


2.1.1  Core  Processor 

The  key  feature  of  the  core  processor  is  the  presence  of  two  ALUs  (arithmetic  and  logic  units).  These 
two  ALUs  must  operate  in  SIMD  (single  instruction  stream,  multiple  data  streams)  fashion:  in  any  given 
clock  cycle,  the  only  instruction  the  second  ALU  is  allowed  to  execute  is  the  instruction  the  first  ALU  is 
executing. 

All  instructions  in  the  Hammerhead  instruction  set  can  be  executed  in  a  single  clock  cycle.  Among 
them  is  an  instruction  that  simultaneously  computes  the  sum,  difference,  and  product  of  two  floating-point 
operands,  for  a  total  of  three  floating-point  operations  in  one  clock  cycle.  This  ability  is  very  useful  when 
performing  FFTs  (fast  Fourier  transforms),  where  the  basic  FFT  butterfly  requires  these  three  operations. 
Combining  the  two  ALUs  with  the  ability  to  perform  three  floating-point  operations  in  a  single  cycle  gives 
us  the  peak  throughput  of  a  single  80  MHz  Hammerhead:  480  Mflop/s. 
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2.1.2  On-Chip  SRAM 


The  Hammerhead  has  524  KB  of  on-chip  SRAM,  arranged  in  two  banks.  This  SRAM  is  dual-ported, 
making  it  accessible  by  both  the  core  processor  and  the  I/O  processor  simultaneously.  Furthermore,  this 
on-chip  memory  is  truly  random  access,  in  that  any  location  may  be  accessed  with  no  wait  states.  This  abil¬ 
ity  is  useful  during  local  comer  turn  operations,  when  on-chip  data  must  be  accessed  in  strided  fashion. 

For  best  performance,  one  bank  of  SRAM  should  be  reserved  for  code,  with  the  data  residing  in  the 
other  bank.  This  arrangement  allows  the  Hammerhead  to  retrieve  code  and  data  from  SRAM  simulta¬ 
neously.  Furthermore,  code  currently  being  executed  on  the  Hammerhead  and  data  currently  being  used 
should  be  resident  in  the  on-chip  SRAM:  the  performance  is  heavily  penalized  if  code  and  data  must  be 
retrieved  from  off-chip  DRAM  (dynamic  random  access  memory). 

2.1.3  I/O  Processor 

The  I/O  processor  provides  background  DMA  (direct  memory  access)  capabilities  to  the  Hammer¬ 
head.  It  can  support  the  following  transfer  types: 

•  between  the  on-chip  SRAM  and  external  memory  or  external  peripherals 

•  between  the  on-chip  SRAM  and  the  internal  memory  of  other  DSPs 

•  between  the  on-chip  SRAM  and  a  host  processor 

•  between  the  on-chip  SRAM  and  the  serial  ports 

•  between  the  on-chip  SRAM  and  the  link  ports 

•  between  external  memory  and  external  peripherals 

Once  it  has  received  a  DMA  command  from  the  core  processor,  it  can  operate  completely  in  the 
background  without  additional  core  processor  support.  This  feature  is  useful  in  hiding  memory  accesses 
behind  computations. 

The  six  link  ports,  which  are  part  of  the  I/O  processor,  are  8  bits  wide  and  were  originally  designed 
to  be  clocked  at  the  processor  clock  rate.  However,  currently  available  Hammerheads  have  link  ports 
clocked  at  half  the  processor  clock  rate,  or  40  MHz. 

2.1.4  External  Port 

The  external  port  is  used  primarily  to  connect  to  external  DRAM.  The  external  port’s  address  bus  is 
32  bits  wide,  while  the  data  bus  is  64  bits  wide.  In  addition  to  this  interface,  the  external  port  also  has  a 
multiprocessor  interface  and  a  host  port. 

2.2  QUAD-HAMMERHEAD  CLUSTER 

The  Hammerheads  in  the  DR  are  grouped  into  three  clusters  of  four  Hammerheads  (see  Figure  5  for 
a  diagram  of  a  single  cluster). 
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As  indicated  before,  each  Hammerhead  eluster  has  eonnections  to  the  input  FPGA,  the  output 
FPGA,  the  host,  and  the  BES. 

Eaeh  eluster  has  34  MB  of  SDRAM,  which  is  accessed  via  the  shared  memory  bus. 

2.3  SHARED  MEMORY  BUS 

The  four  Hammerheads  in  a  cluster  share  a  memory  bus  eonneeting  to  the  eommon  external  mem¬ 
ory.  The  memory  bus  is  64  bits  wide,  and  is  elocked  at  half  the  Hammerhead  clock  rate.  With  a  nominal  40 
MHz  clock  rate,  the  shared  bus’s  peak  bandwidth  is  320  MB/s. 

In  practice,  the  sustained  bandwidths  are  lower.  For  reads  from  DRAM,  there  are  three  cycles  of 
setup  for  every  four  eycles  of  data  access,  or  four  64-bit  words  per  seven  cycles.  Therefore,  the  best  sus¬ 
tained  bandwidth  for  reads  from  DRAM  is  four-sevenths  of  320  MB/s,  or  176  MB/s.  For  writes  to  DRAM, 
there  is  one  eycle  of  setup  for  every  four  cycles  of  data  aceess,  or  four  64-bit  words  per  five  cycles.  There¬ 
fore,  the  best  sustained  bandwidth  for  writes  to  DRAM  is  four-fifths  of  320  MB/s,  or  256  MB/s.  Bench¬ 
mark  results  from  Bittware  show  that  the  aetual  sustained  bandwidth  is  closer  to  240  MB/s. 
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3.  FREQUENCY-DOMAIN  DR  ALGORITHM  DESCRIPTION 


The  BSAR  signal  processing  algorithm  performs  the  following  functions: 

•  FIR  (finite  impulse  response)  filtering 

•  beamforming 

•  subband  selection 

•  alias  correction 

•  phase  correction 

FIR  filtering  is  performed  in  the  frequency  domain,  and  employs  the  overlap  and  save  filtering 
method  to  avoid  the  computational  burden  of  one  long  FFT  for  all  the  samples  and  to  allow  the  data  to  be 
processed  before  they  are  all  available.  Beamforming  and  subband  selection  are  also  performed  in  the  fre¬ 
quency  domain,  and  the  beamforming  coefficients  arc  combined  with  the  FIR  filter  coefficients  for  compu¬ 
tational  expediency. 

Alias  correction  is  performed  when  the  CBASS  subband  selection  (frequency  filter)  response  is 
greater  than  the  detection  bandwidth.  When  the  BSAR  is  emulating  ADCAP  (ADvanced  CAPability)  pro¬ 
cessing,  FIR  filtering  followed  by  decimation  in  the  time  domain  is  equivalent  to  applying  the  equivalent 
filter  in  the  frequency  domain  (i.e.  fast  convolution)  followed  by  a  frequency  folding  operation. 

Phase  correction  is  necessary  to  compensate  for  the  overlap  and  save  filtering  process  and  the  base¬ 
banding  oscillator  signal. 

3.1  ORIGINAL  ALGORITHM  MODE 

3.1.1  Overlap  and  Save  Filtering  FFT 

The  input  data  arc  converted  into  the  frequency  domain  for  filter  application.  Instead  of  performing 
one  long  FFT  on  the  entire  sample  extent,  which  would  require  a  large  number  of  operations  and  a  large 
quantity  of  memory  as  well  as  waiting  for  all  the  data  to  be  available  before  starting,  the  DR  performs 
shorter  FFTs  on  subsets  of  the  data  [1].  Nominally,  for  the  original  algorithm  mode,  6,432  new  samples  and 
1,760  overlap  samples  from  the  previous  data  block  (or  zeros  if  we  are  processing  the  first  block  of  data) 
arc  transformed  using  an  8K  FFT  (see  Figure  6).  Because  the  input  data  arc  real,  a  real  FFT  is  used. 
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8,192  samples  total 


1 ,760  overlap  samples 
(or  zero  padding  for 
first  data  block) 


6,432  new  samples 


Figure  6:  Overlap  and  save  filtering  FFT:  original  algorithm  mode 


3.1.2  Subband  Selection  and  Combined  Filter  Application/Beamforming 

Once  the  data  are  transformed  into  the  frequency  domain,  frequency  bins  corresponding  to  the  sub¬ 
bands  of  interest  are  retained.  Nominally,  for  the  original  algorithm  mode,  278  frequency  bins  from  four 
subbands  are  retained. 

Within  each  subband,  nominally  1 9  beams  are  formed  for  the  original  algorithm  mode.  The  beam- 
space  output  b-  j  for  beam  /  at  frequency  bin  j  is  computed  using  the  following  equation: 


(1) 


where  the  number  of  channels  is  nominally  52,  Cy  ^  is  the  element-space  data  for  frequency  bin  /  and 

channel  k ,  and  w  •  j  j.  is  the  beamforming  coefficient  for  beam  i ,  frequency  bin  j ,  and  channel  k  . 


To  reduce  the  amount  of  data  to  be  transferred  into  the  DR  and  the  amount  of  computation  to  be  per¬ 
formed  during  the  real-time  receive  cycle,  the  frequency-domain  filtering  coefficients  are  combined  with 


the  beamforming  weights  prior  to  their  application  to  the  element-space  data. 

3.1.3  Alias  Correction 

If  the  bandwidth  of  the  filter  exceeds  the  detection  bandwidth,  the  band  edges  must  be  aliased  in  the 
frequency  domain  so  the  resultant  bandwidth  equals  the  detection  bandwidth.  This  processing  is  equivalent 
to  the  time  domain  implementation  used  in  the  ADCAP  front  end  processor  (FIR  filtering  followed  by 
downsampling). 

Let  be  the  number  of  frequency  bins  per  subband,  and  let  Njppj  be  the  length  of  the  I  FFT 

(inverse  FFT)  to  be  performed  in  the  next  stage,  which  is  equal  to  the  detection  bandwidth.  The  number  of 
frequency  bins  on  each  end  to  be  folded  is  equal  to 
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^bins  ~  ^IFFT 


(2) 


Nfold  = 

Let  F  be  the  pre-aliasing  sequence,  with  frequency  bins,  and  let  G  be  the  post-aliasing 

sequence,  with  N/ppj-  frequency  bins.  Then 

1  ^  ^(^bins  ~  ^folJ  '''  *  ’-^bins) 

^'■^bins~^fold^  ^  ^^^^fold'^^'-^bins~'2-Njoit])  (4) 

-  N^d  +  1  -^bins)  =  n  1  -.Nf,,,,)  +  F(yv,,„,  -  +  1  (5) 

(see  Figure  7  below). 


Figure  7:  Alias  correction 


In  the  original  algorithm  mode,  N/ppj  is  nominally  256,  while  is  nominally  278. 


II 


3.1.4  Overlap  and  Save  Filtering  IFFT 


After  alias  correction,  the  frequency  bins  in  each  subband  are  transformed  into  the  time  domain  via 
an  IFFT.  After  the  IFFT,  those  samples  corresponding  to  the  overlapping  data  appended  during  the  FFT 
stage  are  removed  from  the  front  end  of  the  vector. 

If  the  number  of  overlap  samples  appended  during  the  FFT  stage  is  equal  to  ,  then  the  num¬ 

ber  of  samples  to  be  discarded  after  the  IFFT  stage  is  equal  to 


(6) 


where  is  the  length  of  the  FFT  in  the  overlap  and  save  filtering  FFT  stage.  Nominal  values  for  these 
parameters  for  the  original  algorithm  mode  are: 


3.1.5  Phase  Correction 


Phase  correction  is  needed  to  account  for  the  band  seleetion  proeess  in  eaeh  data  frame.  Each  sample 


is  multiplied  by  a  complex  scalar  to  perform  phase  correction. 

3.2  DEFAULT  ALGORITHM 

3.2. 1  Overlap  and  Save  Filtering  FFT 

In  the  new,  default  algorithm  mode,  2,048  new  samples  and  2,048  overlap  samples  from  the  previous 
data  block  are  transformed  using  a  4K  FFT  (see  Figure  8). 


4,096  samples  total 


2,048  overlap  samples  (or  zero 
padding  for  first  data  block) 


2,048  new  samples 


Figure  8:  Overlap  and  save  filtering  FFT:  default  algorithm  mode 
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3.2.2  Subband  Selection  and  Combined  Filter  Application/Beamforming 

Once  the  data  are  transformed  into  the  frequency  domain,  frequency  bins  corresponding  to  the  sub¬ 
bands  of  interest  are  retained.  In  the  default  algorithm  mode,  nominally  140  frequency  bins  from  one  sub¬ 
band  are  retained.  Within  this  single  subband,  14  (but  possibly  as  many  as  35)  beams  are  formed. 

As  was  done  in  the  original  algorithm  mode,  the  frequency-domain  filtering  coefficients  are  com¬ 
bined  with  the  beamforming  weights  prior  to  their  application  to  the  element-space  data  to  reduce  the 
amount  of  data  to  be  transferred  into  the  DR  and  the  amount  of  computation  to  be  performed  during  the 
real-time  receive  cycle. 

3.2.3  Alias  Correction 

In  the  default  algorithm  mode,  alias  correction  is  performed  in  the  same  fashion  as  it  is  performed  in 
the  original  algorithm  mode.  The  only  differences  are  in  the  parameters:  in  this  mode,  is  nominally 

128,  and  is  nominally  140. 

3.2.4  Overlap  and  Save  Filtering  IFFT 

The  relevant  nominal  parameters  for  this  stage  in  this  mode  are: 

^discard  ’ 

•  2,048 

•  128 

3.2.5  Phase  Correction 

Phase  correction  in  this  mode  is  performed  identically  to  phase  correction  in  the  original  algorithm 

mode. 
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4.  COMPUTATIONAL  WORKLOAD 


In  this  chapter,  we  derive  expressions  for  the  workload  for  the  BSAR  signal  processing  algorithm, 
and  give  the  computational  workload  for  the  nominal  parameters  for  both  algorithm  modes. 

As  indicated  before,  instead  of  waiting  for  all  of  the  input  samples  associated  with  a  single  listen 
interval  to  be  collected,  the  data  are  divided  into  bloeks  that  are  processed  as  they  become  available  to  the 
DR.  Therefore,  the  workload  described  below  is  performed  for  each  data  block. 

4.1  OVERLAP  AND  SAVE  FILTERING  FFT 

The  overlap  and  save  filtering  stage  uses  an  -point  FFT.  With  overlap  samples  of  overlap 
between  two  eonsecutive  data  blocks,  the  number  of  samples  processed  per  data  block  is 

^ data  block  size  ^ FFT~  ^ overlap 

Each  -point  real  FFT  requires  flop,  where 

^FFT  ~  2.5 N fT FFT  [2]  (8) 

Given  that  an  -point  FFT  must  be  performed  for  each  of  the  channels,  the  total  flop  count 

for  the  FFT  portion  of  overlap  and  save  filtering  is 

^overlap  &  save  FFT  ^FFT  ^  ^channels  (^) 

=  2.5Nf,frj\o%2J^FFT'^^chantiels  f*^P 

After  the  FFT,  only  the  frequency  bins  of  interest  for  each  of  ^subbands  frequency  subband  arc 
retained  for  additional  processing. 

4.2  COMBINED  FILTER  APPLICATION/BEAMFORMING 

Forming  the  filtered  and  beamformed  output  A(i)  for  a  single  beam  and  a  single  frequency  bin  / 
requires  the  evaluation  of  an  -point  dot  product  between  the  combined  subband  filtering  and 

beamforming  coefficients  (3  and  the  subbanded  frequency-domain  input  data  S .  This  single  dot  product 
will  require 

^  dot  product  ~  channels  ^^P  (10) 

Computing  the  filtered  and  beamformed  output  for  all  beams,  all  subbands,  and  all 

frequency  bins  will  require 
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^  beam  form  ^dot  product  ^  ^  beams  ^  ^suhbands  ^  ^  bins 
=  channels^  ^beams^f^subbands^f^  bins 


(11) 


4.3  ALIAS  CORRECTION 

Alias  correction  is  required  only  if  the  number  of  frequency  bins  in  the  subband  exceeds  the 
number  of  points  in  the  IFFT  Nj^pj.  The  number  of  frequency  bins  to  be  folded  from  the  beginning  of  the 
subband  to  the  end  of  the  subband  is  equal  to 


^bins  ~  ^IFFT 


fold 


(12) 


The  number  of  frequency  bins  to  be  folded  from  the  end  of  the  subband  to  the  beginning  of  the  subband  is 
also  equal  to  . 

Folding  a  single  frequency  bin  will  require  2  flop.  The  number  of  flop  necessary  to  fold  all  of  the 
aliased  frequency  bins  and  all  x  beam  bands  is 

^alias  ^  ^^fold^^ beams'^  ^ subbands  (1^) 

=  2  X  -  Njppj)  X  X  N subbands  ^^P 

4.4  OVERLAP  AND  SAVE  FILTERING  IFFT 

Performing  an  -point  complex  IFFT  requires 

^IFFT  ~  ^^IFFT^^S>2^ IFFT  ^l^P  (14) 

Given  that  an  -point  IFFT  must  be  performed  for  each  of  the  x  beam  bands,  the 

total  flop  count  for  the  IFFT  portion  of  overlap  and  save  filtering  is  equal  to 


^overlap  &  save  IFFT  ^IFFT  ^  ^ beams  ^  ^ subbands  ^l^P 

=  ^^IFFT^^^I^IFFt'^^  beams'^  ^ subbands  ^^P 


(15) 


After  the  IFFT,  the  first  V^v^,./^y;  x  {N jppj/N samples  are  discarded  as  part  of  the  overlap  and 
save  filtering  process.  The  remaining  l^p^^stdec  samples  represent  the  post-decimation  filtered  and  beam- 
formed  time-domain  data,  where 


^post  dec  (  ^FF T  ~  ^o verlan  )  ^ 


N 


IFFT 


FFT  ^overlap'  j\j 


(16) 


FFT 
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4.5  PHASE  CORRECTION 


Phase  correction  entails  multiplying  each  sample  in  the  filtered  and  beamformed  output  by  the  com¬ 
plex  phase  correction  factor  Both  m  and  y  can  be  computed  outside  the  real-time  loop,  so  phase 

correction  requires  6  flop  for  a  complex  multiply  for  each  sample  in  the  output.  The  total  number  of  flop 
needed  to  correct  the  phase  for  all  samples  and  all  ^  ^suhhands  beam  bands  is 

^  phase  correction  ~  post  dec^  ^ beams'^  ^ suhbands  ^ 

These  computations  may  be  subsumed  into  the  previous  IFFT  stage  by  combining  the  IFFT  weights 
with  the  phase  correction  term. 

4.6  COMPUTATIONAL  WORKLOAD  ANALYSIS 

4.6.1  Original  Algorithm  Mode 

The  nominal  parameters  for  the  original  algorithm  mode  are  given  below  in  Table  1 . 


Table  1:  Original  Algorithm  Mode  Nominal  Parameters 


Parameter 

Variable  Name 

Nominal  Value 

number  of  channels 

^channels 

52 

FFT  length  (samples) 

Nff-'T 

8,192 

FFT  overlap  length  (samples) 

^overlap 

1,760 

number  of  frequency  bins  in  a  subband 

^bins 

278 

IFFT  length  (samples) 

^IFFT 

256 

number  of  beams 

^  beams 

19 

number  of  subbands 

^ subbands 

4 
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The  derived  nominal  parameters  for  the  original  algorithm  mode  are  given  below  in  Table  2. 

Table  2:  Original  Algorithm  Mode  Derived  Nominal  Parameters 


Parameter 

Variable  Name 

Nominal  Value 

data  block  size  (samples) 

^ data  block  size 

6,432 

number  of  folded  samples 

^fold 

11 

number  of  post-decimation  samples 

^ post  dec 

201 

The  flop  count  for  the  original  algorithm  mode  is  given  below  in  Table  3. 


Table  3:  Original  Algorithm  Mode  Flop  Count 


Stage 

Variable  Name 

Flop  Count 

Fraction  of 
Total 

overlap  and  save  FFT 

^overlap  &  save  FFT 

13,844,480 

58.9% 

filter  and  beamform 

^beamform 

8,789,248 

37.4% 

alias  correction 

^  alias 

3,344 

0.0% 

overlap  and  save  IFFT 

^overlap  &  save  IFFT 

778,240 

3.3% 

phase  correction 

^ phase  correction 

91,656 

0.4% 

Total 

23,506,968 

100.0% 
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4.6.2  Default  Algorithm  Mode 

The  nominal  parameters  for  the  default  algorithm  mode  are  given  below  in  Table  4. 


Table  4:  Default  Algorithm  Mode  Nominal  Parameters 


Parameter 

Variable  Name 

Nominal  Value 

number  of  channels 

N 

channels 

52 

FFT  length  (samples) 

Nfpt 

4,096 

FFT  overlap  length  (samples) 

^overlap 

2,048 

number  of  frequency  bins  in  a  subband 

^ bins 

140 

IFFT  length  (samples) 

N/fpt 

128 

number  of  beams 

^ beams 

14 

number  of  subbands 

^ subbands 

1 

The  derived  nominal  parameters  for  the  default  algorithm  mode  are  given  below  in  Table  5. 

Table  5:  Default  Algorithm  Mode  Derived  Nominal  Parameters 


Parameter 

Variable  Name 

Nominal  Value 

data  bloek  size  (samples) 

^ data  block  size 

2,048 

number  of  folded  samples 

^fotd 

7 

number  of  post-decimation  samples 

^ post  dec 

128 

19 


The  flop  count  for  the  default  algorithm  mode  is  given  below  in  Table  6. 


Table  6:  Default  Algorithm  Mode  Flop  Count 


Stage 

Variable  Name 

Flop  Count 

Fraction  of 
Total 

overlap  and  save  FFT 

^overlap  &  save  FFT 

6,389,760 

87.8% 

filter  and  beamform 

^beamform 

815,360 

1 1 .2% 

alias  correction 

^  alias 

336 

0.0% 

overlap  and  save  IFFT 

^overlap  &  save  IFFT 

62,720 

0.9% 

phase  correetion 

^ phase  correction 

5,376 

0.1% 

Total 

7,273,552 

100.0% 

4.7  THROUGHPUT  REQUIREMENT 

4.7.1  Original  Algorithm  Mode 

In  an  active  sonar  time  line,  the  torpedo  transmits  a  sonar  waveform  (nominally  264  ms  in  duration), 
then  receives  sonar  data  for  several  seconds  before  beginning  the  next  sonar  cycle.  Instead  of  collecting 
input  data  for  several  seconds,  then  processing  them  all  at  once,  the  data  are  divided  into  data  blocks  and 
processed  as  they  arrive  at  the  DR  (see  Figure  9).  The  combined  beamforming  and  filtering  coefficients 
may  be  computed  during  the  transmit  time,  before  the  end  of  which  input  data  cannot  be  processed. 


transmit:  receive: 

264  ms  several  seconds 


one  data  block 


Figure  9:  Time  line  for  an  active  sonar  cycle 

To  compute  the  sustained  computational  throughput  requirement,  the  total  time  available  to  perform 
the  BSAR  computational  workload  is  needed.  We  shall  use  as  the  time  available  the  amount  of  time  neces¬ 
sary  to  collect  N samples  at  the  nominal  sampling  frequency  of  /sample  ^  • 
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(18) 


N 


^sample 


data  block  size 


fs 


sample 


With  the  nominal  parameters  for  the  original  algorithm  mode,  trample  ^4. 3  ms. 

The  CBASS  program  places  a  50%  spare  processor  throughput  requirement  on  the  BSAR  to  ensure 
sufficient  margins  and  to  avoid  cost  and  schedule  overruns:  the  amount  of  spare  throughput  must  be  at  least 
50%  of  the  throughput  consumed.  Given  that  the  64.3  ms  total  time  for  the  original  algorithm  mode  needs 
to  be  150%  of  the  time  available  for  computation,  the  real-time  execution  time  is  42.9  ms.  Based  on  this 
reduced  available  time,  the  sustained  throughput  requirement  for  the  original  algorithm  mode  is  548.2 
Mflop/s. 

As  long  as  the  number  of  data  samples  processed  in  a  single  data  block  ^jata  Nock  size  remains  con¬ 
stant,  the  throughput  requirements  will  not  change:  shorter  or  longer  receive  intervals  will  not  aflFcct  our 
sustained  throughput  requirement.  Of  course,  the  sustained  throughput  requirement  is  proportional  to  the 
sampling  frequency  . 


4.7.2  Default  Algorithm  Mode 

The  nominal  parameters  for  the  default  algorithm  mode  result  in  a  trample  Because  this 

time  is  150%  of  the  time  available  for  computation,  the  real-time  execution  time  is  13.7  ms.  Based  on  this 
reduced  available  time,  the  sustained  throughput  requirement  for  the  default  algorithm  mode  is  532.7 
Mflop/s. 
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5.  PROCESSOR  ASSESSMENT  METHODOLOGY 


5.1  MOTIVATION 


Based  on  the  measured  sustained  throughput  of  the  Hammerhead  on  the  types  of  computations  per¬ 
formed  in  the  BSAR  algorithm,  and  the  amount  of  computational  work  the  algorithm  entails,  we  may  com¬ 
pute  an  estimated  execution  time. 

In  the  original  algorithm  mode,  the  overlap  and  save  FFT  stage  requires  13.8  Mflop.  The  peak 
throughput  of  an  80  MHz  Hammerhead  on  FFTs  is  480  Mflop/s.  With  12  Hammerheads  total  and  a  mea¬ 
sured  efficiency  of  65%  on  real  8K  FFTs,  we  estimate  the  sustained  throughput  of  the  DR  to  be  3.74  Gflop/ 
s.  Given  these  numbers,  we  would  expect  the  overlap  and  save  FFT  stage  to  take 


13,844,480  flop- 


3.74  X  10^  flop 
second 


3.70  X  10  ^s 


(19) 


We  define  the  efficiency  metric  as  the  ratio  of  the  sustained  throughput,  or  the  computational  perfor¬ 
mance  we  actually  achieved,  to  the  peak  throughput,  or  the  theoretical  best  computational  performance: 


efficiency  = 


sustained  throughput 
peak  throughput 


(20) 


The  beamforming/filtcring  stage  in  the  original  algorithm  mode  requires  8.8  Mflop.  The  peak 
throughput  of  an  80  MHz  Hammerhead  on  matrix  computations  is  320  Mflop/s.  With  12  Hammerheads 
total  and  a  measured  efficiency  of  38%,  we  estimate  the  sustained  throughput  of  the  DR  to  be  1 .46  Gflop/s. 
Given  these  numbers,  we  would  expect  the  beamforming  stage  to  take 


8,789,248  flop  -  l  10  flop  ^  ^  x  10  S 
second 


(21) 


We  can  use  this  process  to  compute  the  total  estimated  execution  time  for  the  original  algorithm 
mode  in  this  manner  (see  Table  7): 
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Table  7:  Naive  Estimated  Execution  Time  for  the  Original  Algorithm  Mode 


Stage 

Flop  Count 

DR  Sustained 
Throughput  (Gflop/s) 

Estimated  Execution 
Time  (ms) 

overlap  and  save  FFT 

13,844,480 

3.74 

3.70 

filter  and  beamform 

8,789,248 

1.46 

6.02 

alias  correction 

3,344 

1.46 

0.00 

overlap  and  save  IFFT 

778,240 

3.51 

0.22 

phase  correction 

91,656 

1.46 

0.06 

Total 

23,506,968 

- 

10.01 

This  estimated  execution  time  -  10.01  ms  -  is  only  a  small  fraction  of  the  42.9  ms  we  have  available 
to  us.  If  we  naively  consider  only  the  computational  throughput  requirements,  then  we  might  come  to  the 
conclusion  that  the  real-time  deadlines  can  be  easily  met.  Our  methodology  represents  a  more  sophisti¬ 
cated  analysis  that  also  considers  the  impact  of  communication  and  contention  on  the  execution  time. 

5.2  PROCESSOR  ASSESSMENT  METHODOLOGY 

In  our  methodology,  we  propose  several  mappings  of  the  BSAR  algorithm  onto  the  DR  architecture, 
then  estimate  the  performance  of  the  DR  using  those  mappings  (see  Figure  10).  The  estimates  of  the  execu¬ 
tion  time  consider  not  only  the  time  necessary  to  perform  the  computations  but  also  the  time  needed  to 
move  the  requisite  data  into  memory  and  to  communicate  data  between  processors.  Based  on  a  preliminary 
execution  time  using  estimated  computation  and  communication  efficiencies,  we  winnow  down  the  field  of 
candidate  mappings  to  the  most  promising  ones.  For  those  remaining  mappings,  the  methodology  calls  for 
a  refinement  of  our  estimate  of  the  execution  time  by  replacing  estimates  of  the  efficiencies  with  efficien¬ 
cies  measured  using  benchmarks.  We  also  perform  sensitivity  analyses  to  gauge  the  degree  to  which  our 
execution  time  estimates  vary  if  we  vary  our  efficiency  estimates. 
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Figure  10:  Processor  assessment  methodology 


In  the  original  version  of  this  report,  we  considered  three  candidate  mappings  and  estimated  their 
performance  based  on  published  and  estimated  efficiencies.  We  also  reported  the  results  of  a  sensitivity 
analysis  performed  on  the  mapping  that  resulted  in  the  best  estimated  performance.  In  this  revision,  we 
focus  our  analysis  to  the  mapping  that  resulted  in  the  best  estimated  performance  and  has  been  adopted  by 
the  BSAR  design  teams. 
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6.  PROPOSED  MAPPING  AND  ESTIMATED  TIMING 


Because  the  processing  on  each  quad-Hammerhead  cluster  is  nearly  identical,  we  will  concentrate 
our  mapping  discussions  to  a  single  quad-Hammerhead  cluster. 

The  efficiencies  given  in  this  chapter  are  based  on  benchmark  experiments  performed  on  a  Bittware 
quad-Hammerhead  board  whose  architecture  closely  resembles  that  of  a  DR  quad-Hammerhead  cluster. 
Given  these  measured  efficiencies,  the  proposed  mapping  results  in  an  predicted  execution  time  of  39.8  ms 
for  the  original  algorithm  mode,  which  is  within  the  42.9  ms  available.  For  the  default  algorithm  mode,  the 
predicted  execution  time  is  6.4  ms,  which  is  within  the  13.7  ms  available. 

6.1  ORIGINAL  ALGORITHM  MODE 


6.1.1  Data  Input 


The  52  channels  of  data  are  distributed  over  the  three  clusters,  with  one  cluster  receiving  1 8  channels 
of  data  and  the  other  two  clusters  receiving  17  channels  of  data  each.  As  the  input  data  arc  real  and  not 
complex,  each  sample  occupies  four  bytes.  With  6,432  samples  per  data  block,  the  quantity  of  data  that 
must  be  brought  into  a  cluster  in  the  worst  case  is  463  KB. 

The  input  FPGA  needs  no  appreciable  amount  of  memory,  as  the  input  data  effectively  trickle  in  at 
the  nominal  sampling  rate  of  100  kHz.  Because  constant  accesses  to  the  shared  DRAM  to  save  the  data  one 
sample  at  a  time  would  prevent  efficient  accesses  to  DRAM  by  the  other  Hammerheads,  we  will  dedicate 
one  Hammerhead  -  the  Hammerhead  with  the  link  port  connected  to  the  input  FPGA  -  to  data  input.  This 
Hammerhead  will  accumulate  several  input  samples  in  SRAM  and  copy  them  into  DRAM  in  a  single 
transfer. 

To  make  the  most  efficient  use  of  the  Hammerhead’s  on-chip  SRAM,  half  of  the  524  KB  will  be 
reserved  for  instructions,  while  the  other  262  KB  will  be  made  available  for  data  storage.  To  allow  the 
input  Hammerhead  to  accept  data  from  the  input  FPGA  while  simultaneously  writing  data  to  DRAM,  we 
will  establish  two  buffers,  each  capable  of  storing  1  /16th  of  the  sample  extent,  or  402  samples,  for  all  1 8 
channels.  Each  buffer  will  therefore  be  29  KB  in  size. 


The  measured  sustained  bandwidth  of  the  shared  memory  bus  for  writes  to  DRAM  is  240  MB/s.  At 
this  rate,  the  time  to  copy  1  /16th  of  the  data  into  DRAM  is 


28,944  bytes 


240  X  1 0^  bytes 
second 


121  X  10  % 


(22) 


One  such  copy  to  DRAM  will  occur  every  402  samples,  or  4.02  x  10  ^  s .  The  total  time  spent  copying 
data  into  DRAM  is  1,930  x  10  ^  s  per  data  block. 
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6.1.2  Overlap  and  Save  Filtering  FFT 

For  each  channel’s  worth  of  data,  an  8K-point  real  FFT  must  be  performed  as  the  first  step  in  the 
overlap  and  save  filtering.  As  one  of  the  four  Hammerheads  in  each  cluster  is  dedicated  to  data  input,  the 
signal  processing  computations  will  be  performed  by  the  remaining  three  Hammerheads.  Each  Hammer¬ 
head’s  share  of  the  data  for  this  stage,  in  the  worst  case,  is  six  channels. 

On-Chip  Memory  Usage.  The  6,432  samples  of  new  data  plus  the  1,760  samples  of  old  overlap  data 
will  occupy  33  KB.  Even  though  only  33  KB  are  copied  from  DRAM  to  SRAM  for  a  single  vector,  the 
input  data  vector  will  occupy  twice  that  space  -  66  KB  -  because  the  vector  must  be  large  enough  to  store 
the  complex  results  of  the  in-place  FFT. 

To  keep  the  processor  busy  performing  FFTs,  we  will  set  up  two  separate  buffers,  each  66  KB  in 
size.  While  the  processor  is  performing  an  FFT  on  the  data  in  one  buffer,  the  other  buffer  will  either  be 
filled  with  more  data  from  DRAM  or  copied  back  into  DRAM.  These  two  buffers  together  will  occupy  1 3 1 
KB  of  the  on-chip  SRAM.  We  will  also  need  to  store  in  SRAM  the  FFT  weights,  which  will  occupy  33 
KB. 


Memory  Access:  DRAM  to  SRAM.  The  33  KB  of  input  data  must  be  copied  from  DRAM  to  SRAM 
over  the  shared  memory  bus.  The  measured  sustained  bandwidth  of  the  shared  memory  bus  for  reads  from 
DRAM  is  176  MB/s.  At  this  rate,  the  time  to  copy  the  input  data  from  DRAM  is 


32,768  bytes  h- 


176  X  10^  bytes 
second 


186 X  10  ^  s 


(23) 


FFT  Processing.  A  single  8K-point  real  FFT  will  require  266,240  flop.  The  measured  sustained 
throughput  of  the  Hammerhead  on  a  real  8K  FFT  is  312  Mflop/s.  At  this  rate,  the  time  for  a  single  8K- 
point  real  FFT  is 


266,240  flop  - 


312  X  10^  flop 

second 


853  X  10  ^  s 


(24) 


Memory  Access:  SRAM  to  DRAM,  After  the  8K-point  FFT,  we  arc  only  interested  in  278  frequency 
bins  for  each  of  four  subbands,  which  together  will  occupy  9  KB.  The  measured  sustained  bandwidth  of 
the  shared  memory  bus  for  writes  to  DRAM  is  240  MB/s.  At  this  rate,  the  time  to  copy  the  frequency- 
domain  data  to  DRAM  is 


8,896  bytes  -5- 


240  X  10^  bytes 
second 


37x  10  S 


(25) 


Simultaneous  Operation  of  Three  Hammerheads  in  a  Cluster.  The  three  Hammerheads  performing 
signal  processing  in  a  single  Hammerhead  cluster  may  perform  FFTs  simultaneously.  However,  because 
they  share  the  cluster  memory  bus,  they  must  access  DRAM  sequentially.  Therefore,  the  second  Hammer¬ 
head  cannot  begin  loading  its  data  from  DRAM  into  its  first  buffer  in  SRAM  until  the  first  Hammerhead 
has  finished  loading  its  data  (see  Figure  11). 
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Figure  1 1 :  Sequential  access  of  DRAM  during  overlap  and  save  FFT 


Once  the  third  Hammerhead  has  finished  loading  its  data  from  DRAM  into  its  first  buffer  in  SRAM, 
the  first  Hammerhead  may  begin  loading  its  second  buffer  in  SRAM.  Because  the  Hammerhead  has  a  dual- 
ported  SRAM  and  an  I/O  processor  that  is  independent  from  the  ALU,  it  can  load  the  second  buffer  while 
still  processing  the  first  buffer.  Furthermore,  because  the  second  buffer  is  expected  to  be  fully  loaded 
before  processing  on  the  first  buffer  is  completed,  the  Hammerhead  can  begin  processing  the  second  buffer 
as  soon  as  it  has  completed  processing  the  first  buffer.  Copying  data  from  a  buffer  in  SRAM  to  DRAM  can 
be  done  concurrently  with  processing  as  well. 

The  first  Hammerhead  cluster  processes  18  channels,  while  the  second  and  third  clusters  process  17 
channels  each.  In  the  first  Hammerhead  cluster,  all  three  Hammerheads  process  six  channels.  The  last  pro¬ 
cessor  to  complete  its  processing  will  be  the  third  Hammerhead  in  the  first  cluster  (see  Figure  12). 
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Figure  12:  FFT  stage  processing  time  line 
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Aside  from  the  initial  copy  from  DRAM  into  the  first  buffer  in  SRAM  and  the  final  copy  from 
SRAM  to  DRAM,  all  memory  accesses  during  the  FFT  stage  can  be  overlapped  with  computation.  We  can 
now  estimate  the  total  execution  time  for  this  stage,  which  will  be  dictated  by  the  third  Hammerhead  in  the 
first  cluster: 

•  wait  for  the  first  Hammerhead  to  load  an  8K-point  real  vector:  1 86  x  10  ^  s 

•  wait  for  the  second  Hammerhead  to  load  an  8K-point  real  vector:  1 86  x  1 0  ^  s 

•  load  an  8K-point  real  vector:  186  x  10  ^  s 

•  perform  8K-point  real  FFTs  for  six  channels:  5,120  x  10  ^  s 

•  store  four  278-point  complex  vectors:  37  x  10  ^  s 

for  a  total  of5,716xl0^s. 

6.1.3  Beamforming  and  Filtering:  Step  1 

Each  Hammerhead  forms  partial  beamforming  sums  from  five  or  six  channels’  worth  of  data,  then 
redistributes  the  data  prior  to  completing  the  beamforming  operation.  Fundamentally,  the  bcamforming 
operation  for  a  single  beam  and  a  single  frequency  bin  involves  the  following  operation: 

52 

/  =  I 

where  h  is  the  beam-space  output,  is  the  element-space  data  for  the  /  th  channel,  and  vvy  is  the  bcam¬ 
forming  weight  for  the  i  th  channel. 

This  sum  can  be  divided  up  by  the  channels  local  to  each  Hammerhead  cluster: 


h  =  sum^  +  sum2  +  sum^ 

(27) 

18 

sum^  =  Cj  X  w- 

i=  1 

(28) 

35 

sum2  =  ^  ^ 

(29) 

/  =  19 

52 

sum^  =  "V  Cy  X  Wj 

(30) 

/  =  36 

30 


Each  of  sum^ ,  sum2 ,  and  sum^^  can  be  further  subdivided  by  channels  assigned  to  each  Hammer¬ 
head: 

6  12  18 

sum^  =  w-  +  ^  X  %  +  ^  ^  ^ 


/  =  1  / 

=  7  /  = 

13 

24 

30 

35 

sum2  =  ^  ^ 

X  ^ 

(32) 

/  =  19 

/•  =  25  i 

•=  31 

41 

47 

52 

sum 2^  =  ^  ^ 

X  ^  ^ 

X 

(33) 

II 

/•  =  42  i 

•  =  48 

To  compute  each  of  the  nine  sums  in  Equation  31  through  Equation  33,  we  use  the  following  map¬ 
ping  proposed  by  NUWC  (the  Naval  Undersea  Warfare  Center)  for  each  Hammerhead: 

•  Copy  the  element-space  data  matrix  for  the  five  or  six  ehannels  local  to  each  processor  and 
all  frequency  bins  from  DRAM  to  SRAM. 

•  For  each  of  the  x  beam  bands: 

-  Copy  the  -point  vector  with  the  beamforming/filtering  weights  for  the  first  channel 
from  DRAM  to  SRAM. 

-  Repeat  for  all  five  or  six  channels:  Multiply  the  weights  vector  with  the  corresponding 
vector  from  the  data  matrix  and  add  this  product  vector  to  the  partial  beamforming  sum 
vector  while  simultaneously  eopying  the  weights  for  the  next  channel  from  DRAM  to 
SRAM. 

-  Copy  the  -point  vector  with  the  partial  sums  for  these  five  or  six  channels  from 
SRAM  to  DRAM. 


On-Chip  Memory  Usage.  The  element-space  data  for  six  channels  and  278  frequency  bins  will 
occupy  13  KB.  The  beamforming/filtering  weights  vectors  for  two  channels  (we  will  use  one  weights  vec¬ 
tor  while  loading  the  next  weights  vector)  will  occupy  4  KB.  The  partial  beamforming  sum  vector  will 
require  2  KB.  Together,  these  data  products  will  require  20  KB. 

Memory  Access:  DRAM  to  SRAM.  The  element-spaee  data  for  six  ehannels  and  278  frequency  bins 
will  occupy  13  KB.  The  measured  sustained  bandwidth  of  the  shared  memory  bus  for  reads  from  DRAM  is 
176  MB/s.  At  this  rate,  the  time  to  eopy  the  element-space  data  from  DRAM  is 


13,344  bytes  ^ 


176  X  10^  bytes 
second 


76  X  10  ^  s 


(34) 
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The  beamforming/filtering  weights  for  one  channel  will  occupy  2  KB.  The  time  to  copy  the  weights 
from  DRAM  is 


2,224  bytes  -h 


176  X  10^  bytes 
second 


13  X  10  S 


(35) 


Beamforming,  Adding  the  partial  beamforming  sums  from  one  channel  for  all  278  frequency  bins 
will  require  2,224  flop.  The  measured  sustained  throughput  of  the  Hammerhead  on  a  multiply-accumulatc 
operation  of  this  size  is  121 .6  Mflop/s.  At  this  rate,  the  time  for  this  multiply-accumulate  operation  is 


^  ^  121.6x10  bytes  ^ 

2,224  flop  —  =18x10  s 

second 


(36) 


Memory  Access:  SRAM  to  DRAM,  The  partial  beamforming  sum  vector  will  occupy  2  KB.  The 
measured  sustained  bandwidth  of  the  shared  memory  bus  for  writes  to  DRAM  is  240  MB/s.  At  this  rate, 
the  time  to  copy  the  partial  sum  vector  to  DRAM  is 


2,224  bytes  ^ 


240  X  1 0^  bytes 
second 


9x  10  S 


(37) 


Simultaneous  Operation  of  Three  Hammerheads  in  a  Cluster,  The  three  Hammerheads  performing 
signal  processing  in  a  single  Hammerhead  cluster  may  perform  matrix  multiplication/accumulation  simul¬ 
taneously.  However,  because  they  share  the  cluster  memory  bus,  they  must  access  DRAM  sequentially. 
Therefore,  the  second  Hammerhead  cannot  begin  loading  its  data  from  DRAM  into  SRAM  until  the  first 
Hammerhead  has  finished  loading  its  data  (see  Figure  13). 
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Figure  13:  Sequential  aecess  of  DRAM  during  beamforming:  step  1 


Because  the  computation  time  is  less  than  three  times  (because  there  are  three  Hammerheads  sharing 
the  memory  bus)  the  time  needed  to  load  weights  from  DRAM,  the  total  time  for  the  first  step  of  beam- 
forming  will  be  almost  entirely  limited  by  the  shared  memory  bus  bandwidth.  The  last  processor  to  com¬ 
plete  its  processing  will  be  the  third  Hammerhead  in  each  cluster.  We  can  now  estimate  the  total  execution 
time  for  the  first  step  of  beamforming,  which  will  be  dictated  by  the  third  Hammerhead: 

•  wait  for  the  first  Hammerhead  to  load  the  element-space  data  for  six  channels:  76  x  10  ^  s 

•  wait  for  the  second  Hammerhead  to  load  the  element-space  data  for  six  channels: 

76  X  10  S 

•  load  the  element-space  data  for  six  channels:  76  x  10  ^  s 

•  for  each  beam  band: 

-  load  the  beamforming/filtering  weights  for  all  six  channels  for  all  three  Hammerheads: 
13x  10  Sx3  x6  =  227x  10  S 

-  perform  multiply/accumulate  for  the  sixth  channel:  18  x  10  ^  s 

-  store  partial  beamforming  sum  in  DRAM:  9  x  10  ^  s 

for  a  total  of  19,608  x  10  ^  s  for  76  beam  bands. 
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6. 1 .4  Beamforming/Filter  Application  Step  2 


In  the  second  step  of  beamforming/filtering,  we  accumulate  the  three  partial  sums  that  were  com¬ 
puted  within  a  single  Hammerhead  cluster  (see  Equation  31  through  Equation  33).  Specifically,  each  Ham¬ 
merhead  loads  from  DRAM  all  three  partial  sums  for  one-third  of  the  x  beam  bands, 

where  each  input  partial  sum  is  computed  from  six  channels’  worth  of  element-space  data,  and  combines 
them  to  form  one  set  of  output  partial  sums  for  one-third  of  the  beam  bands. 

Before  we  load  the  three  partial  sums  computed  during  the  first  step  of  beamforming/filtering,  we 
need  to  synchronize  the  three  Hammerheads  in  each  cluster  to  force  the  processors  to  wait  for  the  previous 
step  to  complete. 

On-Chip  Memory  Usage.  Because  of  the  memory  limitations  of  the  on-chip  SRAM,  we  will  process 
half  of  the  278  frequency  bins  at  a  time.  One-third  of  76  beam  bands,  rounded  up,  is  26  beam  bands.  The 
three  sets  of  input  partial  sums  for  26  beam  bands  and  139  frequency  bins  will  occupy  87  KB.  The  output 
partial  sums  for  26  beam  bands  and  139  frequency  bins  will  occupy  29  KB.  Together,  these  data  products 
will  require  116  KB. 

Memory  Access:  DRAM  to  SRAM.  The  three  sets  of  input  partial  sums  will  occupy  87  KB.  The 
measured  sustained  bandwidth  of  the  shared  memory  bus  for  reads  from  DRAM  is  176  MB/s.  At  this  rate, 
the  time  to  copy  the  input  partial  sums  from  DRAM  is 


86,736  bytes  -s- 


176  X  10^  bytes 
second 


=  493  X  10^  s 


(38) 


Beamforming.  Forming  the  cluster-wide  partial  sums  for  26  beam  bands  and  139  frequency  bins 
will  require  14,456  flop.  The  measured  sustained  throughput  of  the  Hammerhead  on  matrix  operations  of 
this  size  is  38  Mflop/s.  At  this  rate,  the  time  for  the  second  step  of  beamforming  and  filter  application  is 

14,456  flop  ^  =  376  X  10  S  (39) 

second 


Memory  Access:  SRAM  to  DRAM.  The  cluster-wide  partial  sums  for  26  beam  bands  and  139  fre¬ 
quency  bins  will  occupy  29  KB.  The  measured  sustained  bandwidth  of  the  shared  memory  bus  for  writes  to 
DRAM  is  240  MB/s.  At  this  rate,  the  time  to  copy  the  partial  sums  to  DRAM  is 


28,912  bytes  -5- 


240  X  10^  bytes 
second 


120  X  10^  s 


(40) 


Simultaneous  Operation  of  Three  Hammerheads  in  a  Cluster.  The  three  Hammerheads  performing 
signal  processing  in  a  single  Hammerhead  cluster  may  add  the  partial  sums  simultaneously.  However, 
because  they  share  the  cluster  memory  bus,  they  must  access  DRAM  sequentially.  Therefore,  the  second 
Hammerhead  cannot  begin  loading  its  data  from  DRAM  into  SRAM  until  the  first  Hammerhead  has  fin¬ 
ished  loading  its  data.  Furthermore,  because  the  data  access  time  exceeds  the  computation  time,  virtually 
all  of  the  computation  time  can  be  hidden  behind  the  data  access  time  (see  Figure  14). 
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store  output  partial  sums 


We  can  now  estimate  the  total  execution  time  for  the  second  step  of  beamforming,  which  will  be  dic¬ 
tated  by  the  third  Hammerhead: 

•  wait  for  the  first  Hammerhead  to  load  its  three  sets  of  input  partial  sums:  493  x  10  ^  s 

•  wait  for  the  second  Hammerhead  to  load  its  three  sets  of  input  partial  sums:  493  x  10  ^  s 

•  load  the  three  sets  of  input  partial  sums:  493  x  10  ^  s 

•  add  the  three  partial  sums:  376  x  10  ^  s 

•  store  the  output  partial  sum  in  DRAM:  120  x  10  ^  s 

for  a  total  of  3,951  x  10  ^  s  for  all  278  frequency  bins. 

6.1.5  Bcamforming  and  Filtering:  Inter-Cluster  Communication 

Prior  to  the  final  step  of  beamforming  and  filter  application,  inter-cluster  communication  is  neces¬ 
sary  to  exchange  beamforming  partial  sums  between  clusters.  At  the  end  of  the  beamforming  and  filter 
application  stage,  we  would  like  to  have  the  beams  equally  distributed  over  the  three  Hammerhead  clus¬ 
ters,  with  the  entire  frequency  subband  in  a  single  Hammerhead  cluster.  This  arrangement  will  allow  for 
the  remaining  stages  of  the  BSAR  processing  to  be  done  local  to  a  single  Hammerhead. 

Before  we  transfer  the  partial  sums  between  the  Hammerhead  clusters,  we  need  to  synchronize  all 
nine  Hammerheads  performing  signal  processing  to  force  the  processors  to  wait  for  the  previous  step  to 
complete. 
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Prior  to  the  inter-cluster  communication,  each  cluster  has  a  partial  sum  for  all  76  beam  bands  and 
278  frequency  bins  (see  Figure  15). 
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Figure  15:  Data  distribution  prior  to  the  inter-cluster  communication 


After  the  inter-cluster  communication,  each  cluster  will  have  all  three  partial  sums  for  a  third  of  the 
76  beam  bands  (see  Figure  16). 
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Figure  16:  Data  distribution  after  the  inter-cluster  communication 


This  communication  operation  entails  sending,  in  the  worst  case,  26  beam  bands  of  partial  sums  to 
one  Hammerhead  cluster  and  26  beam  bands  of  partial  sums  to  the  other  Hammerhead  cluster.  This  opera¬ 
tion  can  be  performed  in  two  stages:  data  transfer  to  the  neighboring  cluster  to  the  left,  and  data  transfer  to 
the  neighboring  cluster  to  the  right  (see  Figure  17). 
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Figure  17:  Inter-cluster  communication  to  exchange  beamforming  partial  sums 


In  both  directions,  the  worst  case  quantity  of  data  to  be  transferred  is  58  KB  for  26  partial  sums  and 
278  frequency  bins. 

On-Chip  Memory  Usage.  Because  of  the  memory  limitations  of  the  on-chip  SRAM,  we  will  transfer 
half  of  the  278  frequency  bins  at  a  time.  The  two  outgoing  sets  of  partial  sums  will  require  58  KB.  The  two 
incoming  sets  of  partial  sums  will  require  an  additional  58  KB,  for  a  total  of  1 16  KB. 

Memory  Access:  DRAM  to  SRAM.  The  two  sets  of  outgoing  partial  sums  occupy  58  KB.  The  mea¬ 
sured  sustained  bandwidth  of  the  shared  memory  bus  for  reads  from  DRAM  is  176  MB/s.  At  this  rate,  the 
time  to  copy  the  outgoing  data  from  DRAM  is 


57,824  bytes 


176  X  10^  bytes 
second 


329  X  10  '’s 


(41) 


Inter-Cluster  Communication.  The  measured  sustained  bandwidth  of  the  link  ports  is  32  MB/s.  At 
this  rate,  the  time  to  perform  these  two  transfers  is 


57,824  bytes  -5- 


32  X  10^  bytes 
second 


1,807  X  10  ^  s 


(42) 


Memory  Access:  SRAM  to  DRAM.  The  two  sets  of  incoming  partial  sums  occupy  58  KB.  The  mea¬ 
sured  sustained  bandwidth  of  the  shared  memory  bus  for  writes  to  DRAM  is  240  MB/s.  At  this  rate,  the 
time  to  copy  both  sets  of  partial  sums  to  DRAM  is 
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57,824  bytes  -s- 


240  X  10*  bytes 
second 


241  X  10  *  s 


The  total  time  needed  to  transfer  all  278  frequency  bins  is 

2  X  (329  +  1,807  +  241 )  x  10  *  s  =  4,753  x  10  *  s 


(43) 


(44) 


6. 1 .6  Beamforming/Filter  Application  Step  3 


In  the  third  step  of  beamforming/filtering,  we  combine  the  partial  sums  across  the  three  Hammer¬ 
head  clusters.  Specifically,  each  of  the  nine  Hammerheads  performing  signal  processing  will  combine  the 
three  partial  sums  for  one-ninth  of  the  beam  bands  to  produce  the  total  beamformcd  sum. 

Before  we  load  the  three  partial  sums  computed  during  the  first  two  step  of  beamforming/filtering, 
we  need  to  synchronize  all  nine  Hammerheads  performing  signal  processing  to  force  the  processors  to  wait 
for  the  previous  step  to  complete. 

On-Chip  Memory  Usage.  One-ninth  of  the  76  beam  bands,  rounded  up,  is  nine  beam  bands.  The 
three  sets  of  partial  sums  for  nine  beam  bands  and  278  frequency  bins  will  occupy  60  KB.  The  output  total 
sums  for  nine  beam  bands  and  278  frequency  bins  will  occupy  20  KB.  Together,  these  data  products  will 
require  80  KB. 

Memory  Access:  DRAM  to  SRAM.  The  three  sets  of  partial  sums  will  occupy  60  KB.  The  measured 
sustained  bandwidth  of  the  shared  memory  bus  for  reads  from  DRAM  is  1 76  MB/s.  At  this  rate,  the  time  to 
copy  the  partial  sums  from  DRAM  is 

60,048  bytes  -  1^6  x  10  bytes  ^  ^  ^ 

second 

Beamforming.  Forming  the  beamformed  output  for  nine  beam  bands  and  278  frequency  bins  will 
require  10,008  flop.  The  measured  sustained  throughput  of  the  Hammerhead  on  matrix  operations  of  this 
size  is  38  Mflop/s.  At  this  rate,  the  time  for  the  third  step  of  beamforming  and  filter  application  is 

10,008  flop  -  flop  =  261  X  10  *  s  (46) 

second 


Memory  Access:  SRAM  to  DRAM.  The  beamformed  output  for  278  frequency  bins  will  occupy  20 
KB  of  memory.  The  measured  sustained  bandwidth  of  the  shared  memory  bus  for  writes  to  DRAM  is  240 
MB/s.  At  this  rate,  the  time  to  copy  the  beamformed  output  back  to  DRAM  is 


20,016  bytes 


240  X  10^  bytes 
second 


83  X  10  *  s 


(47) 


Simultaneous  Operation  of  Three  Hammerheads  in  a  Cluster.  The  three  Hammerheads  performing 
signal  processing  in  a  single  Hammerhead  cluster  may  add  the  partial  sums  simultaneously.  However, 
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because  they  share  the  cluster  memory  bus,  they  must  access  DRAM  sequentially.  Therefore,  the  second 
Hammerhead  cannot  begin  loading  its  data  from  DRAM  into  SRAM  until  the  first  Hammerhead  has  fin¬ 
ished  loading  its  data.  Furthermore,  because  the  data  access  time  exceeds  the  computation  time,  virtually 
all  of  the  computation  time  can  be  hidden  behind  the  data  access  time  (see  Figure  18). 


We  can  now  estimate  the  total  execution  time  for  the  third  step  of  beamforming,  which  will  be  dic¬ 
tated  by  the  third  Hammerhead: 

•  wait  for  the  first  Hammerhead  to  load  its  three  sets  of  partial  sums:  341  x  10  ^  s 

•  wait  for  the  second  Hammerhead  to  load  its  three  sets  of  partial  sums:  341  x  10  ^  s 

•  load  the  three  sets  of  partial  sums:  341  x  10  ^  s 

•  add  the  three  partial  sums:  261  x  10  ^  s 

•  store  the  output  total  sum:  83  x  10  ^  s 

for  a  total  of  1,368  x  10  ^  s  for  all  278  frequency  bins. 

We  can  now  calculate  the  projected  execution  time  for  the  entire  beamforming/filtering  stage: 

•  step  1:  19,608  x  10  ^  s 

•  step  2:  3,951  x  10  ^  s 

•  inter-cluster  communication:  4,753  x  10  ^  s 

•  step  3:  1,368  x  10  ^  s 
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for  a  total  of  29,680  x  10  ^  s  . 

6.1.7  Alias  Correction 

For  the  alias  correction  stage,  all  44  overlap  frequency  bins  for  all  beam  bands  local  to  a  single  Ham¬ 
merhead  can  be  copied  into  SRAM.  With  nine  Hammerheads  performing  computations,  in  the  worst  case, 
a  Hammerhead  will  have  nine  beam  bands. 

On-Chip  Memory  Usage.  The  44  overlap  frequency  bins  for  nine  beam  bands  will  occupy  3  KB. 
The  22  aliased  frequency  bins  for  nine  beam  bands  will  occupy  2  KB. 

Memory  Access:  DRAM  to  SRAM.  The  measured  sustained  bandwidth  of  the  shared  memory  bus 
for  reads  from  DRAM  is  176  MB/s.  At  this  rate,  the  time  to  copy  the  overlap  frequency  bins  from  DRAM 
is 


3,168  bytes  ^ 


176  X  10^  bytes 
second 


18  X  10^  s 


(48) 


Alias  Correction.  Alias  correction  for  44  overlap  frequency  bins  and  nine  beam  bands  will  require 
396  flop.  The  measured  sustained  throughput  of  the  Hammerhead  on  matrix  operations  is  38  Mflop/s.  At 
this  rate,  the  time  to  perform  alias  correction  is 


396  flop  ^ 


38x  10^  flop 
second 


lOx  10  S 


(49) 


Memory  Access:  SRAM  to  DRAM.  After  alias  correction,  there  are  22  aliased  frequency  bins  for 
each  beam  band.  The  alias  corrected  bins  will  occupy  2  KB.  The  measured  sustained  bandwidth  of  the 
shared  memory  bus  for  writes  to  DRAM  is  240  MB/s.  At  this  rate,  the  time  to  copy  the  aliased  bins  to 
DRAM  is 


1,584  bytes  -5- 


240  X  10^  bytes 
second 


7x  10  ^  s 


(50) 


Simultaneous  Operation  of  Three  Hammerheads  in  a  Cluster.  Because  the  computation  time  for 
alias  correction  is  more  than  one-third  of  the  SRAM  to  DRAM  memory  access  time,  this  memory  access 
cannot  be  hidden  in  the  computation  time,  and  the  execution  time  for  alias  correction  will  be  dictated  by 
the  total  memory  access  time  (see  Figure  19).  The  sum  of  the  memory  access  times  for  three  Hammerheads 
is 


3x|(18x  10  S)  +  (7x  10  %)1  =  74x  10  % 


(51) 
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6. 1 .8  Overlap  and  Save  Filtering  IFFT 

For  the  IFFT  portion  of  overlap  and  save  filtering,  we  will  employ  two  buffers,  as  was  done  with  the 
FFT  portion  of  overlap  and  save  filtering,  to  allow  simultaneous  processing  and  memory  access. 

On-Chip  Memory  Usage.  Two  buffers  each  capable  of  storing  a  complex  256-point  vector  will 
occupy  4  KB.  The  IFFT  weights  vector  will  require  an  additional  1  KB. 

Memory  Access:  DRAM  to  SRAM.  The  measured  sustained  bandwidth  of  the  shared  memory  bus 
for  reads  from  DRAM  is  1 76  MB/s.  At  this  rate,  the  time  to  copy  a  complex  256-point  vector  from  DRAM 
is 


2,048  bytes 


176  X  10^  bytes 
second 


12  X  10^  s 


(52) 


IFFT  Processing.  A  single  256-point  complex  FFT  will  require  10,240  flop.  The  measured  sus¬ 
tained  throughput  of  the  Hammerhead  on  FFTs  is  293  Mflop/s.  At  this  rate,  the  time  to  perform  one  256- 
point  complex  IFFT  is 


10,240  flop  -5- 


293  X  10^  flop 
second 


35x  10  % 


(53) 


Memory  Access:  SRAM  to  DRAM.  After  the  IFFT,  we  are  only  interested  in  the  201  non-overlap 
time  samples,  which  will  occupy  2  KB.  The  measured  sustained  bandwidth  of  the  shared  memory  bus  for 
writes  to  DRAM  is  240  MB/s.  At  this  rate,  the  time  to  copy  the  complex  201 -point  vector  to  DRAM  is 
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1,608  bytes  ^ 


240  X  10^  bytes 
second 


7x  10  ^  s 


(54) 


Simultaneous  Operation  of  Three  Hammerheads  in  a  Cluster.  Because  the  computation  time  for 
the  IFFT  stage  does  not  exceed  the  total  memory  access  time,  the  memory  accesses  cannot  be  completely 
overlapped  with  the  computations  (see  Figure  20). 


IFFT  on  first  buffer  IFFT  on  second  buffer 


load  first  buffer 


load  second  buffer 


store  first  buffer 


H1:  buffer  1 
H1:  buffer  2 

H2;  buffer  1 
H2:  buffer  2 

H3;  buffer  1 
H3:  buffer  2 


memory 

bus 


time 


Figure  20:  Execution  timing  for  overlap  and  save  IFFT 


The  total  execution  time  will  be  dictated  largely  by  the  time  needed  for  all  the  memory  accesses: 

•  three  copies  for  each  of  eight  beam  bands  from  DRAM  to  SRAM:  279  x  lO  s 

•  three  copies  for  each  of  eight  beam  bands  from  SRAM  to  DRAM:  161  x  10  ^  s 

•  three  copies  for  the  ninth  beam  band  from  DRAM  to  SRAM:  35  x  10  ^  s 

•  one  256-point  complex  FFT:  35  x  10  ^  s 

•  one  copy  for  the  ninth  beam  band  from  SRAM  to  DRAM:  7x10  ^  s 
for  a  total  of  5 1 7  x  1 0  ^  s . 


6. 1 .9  Phase  Correction 

For  the  phase  correction  stage,  all  201  time  samples  for  all  beam  bands  local  to  a  Hammerhead  can 
fit  in  the  on-chip  SRAM.  With  nine  Hammerheads  performing  computations,  in  the  worst  case,  a  Hammer¬ 
head  will  have  nine  beam  bands. 

On-Chip  Memory  Usage.  The  201  time  samples  for  nine  beams  will  occupy  14  KB.  Because  phase 
correction  can  be  done  in-place,  we  will  not  need  separate  input  and  output  buffers. 
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Memory  Access:  DRAM  to  SRAM.  The  measured  sustained  bandwidth  of  the  shared  memory  bus 
for  reads  from  DRAM  is  176  MB/s.  At  this  rate,  the  time  to  copy  the  201  time  samples  for  nine  beam 
bands  from  DRAM  is 

14,472  bytes  -  —  ^  =  82  x  10  S  (55) 

second 

Phase  Correction.  Phase  correction  for  201  time  samples  and  nine  beam  bands  will  require  10,854 
flop.  The  measured  sustained  throughput  of  the  Hammerhead  on  matrix  operations  of  this  size  is  38  Mflop/ 
s.  At  this  rate,  the  time  to  perform  phase  correction  is 

10,854  flop  -  38x  10  flop  ^  233  x  10  S  (56) 

second 

Memory  Access:  SRAM  to  DRAM.  The  measured  sustained  bandwidth  of  the  shared  memory  bus 
for  writes  to  DRAM  is  240  MB/s.  At  this  rate,  the  time  to  copy  the  201  phase  corrected  time  samples  for 
nine  beam  bands  to  DRAM  is 


14,472  bytes  ^ 


240  X  10^  bytes 
second 


60x  10^  s 


(57) 


Simultaneous  Operation  of  Three  Hammerheads  in  a  Cluster.  Each  of  the  three  Hammerheads  per¬ 
forming  signal  processing  in  a  single  Hammerhead  cluster  will  be  performing  phase  correction  simulta¬ 
neously.  However,  because  they  share  the  cluster  memory  bus,  they  must  access  DRAM  sequentially. 
Therefore,  the  second  Hammerhead  cannot  begin  loading  its  data  from  DRAM  into  its  first  buffer  in 
SRAM  until  the  first  Hammerhead  has  finished  loading  its  data  (see  Figure  21). 
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As  the  last  processor  to  complete  its  processing  will  be  the  third  Hammerhead,  we  can  now  estimate 
the  total  execution  time  for  this  stage: 

•  wait  for  the  first  Hammerhead  to  load  its  input  data:  82  x  10  ^  s 

•  wait  for  the  second  Hammerhead  to  load  its  input  data:  82  x  10  ^  s 

•  load  the  input  data:  82  x  10  ^  s 

•  correct  phase:  283  x  10  ^  s 

•  store  the  output  data:  60  x  10  ^  s 

for  a  total  of  590  x  10  ^  s . 

6.1.10  Data  Output 

Each  Hammerhead  cluster  has,  in  the  worst  case,  26  beam  bands  worth  of  data.  Therefore,  the  total 
quantity  of  data  to  be  transferred  to  the  output  FPGA  from  a  single  Hammerhead  cluster  is  42  KB.  The 
measured  sustained  bandwidth  of  the  link  port  is  32  MB/s.  At  this  rate,  the  time  to  copy  the  data  to  the  out¬ 
put  FPGA  is 

41,808  bytes  -  =  1,307  X  10  S  (58) 

second 
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6. 1 . 1 1  Timing  Summary 

The  estimated  exeeution  time  for  the  various  stages  are  summarized  below  in  Table  8. 

Table  8:  Estimated  Execution  Time  (Original  Algorithm  Mode) 


Stage 

Estimated  Execution 
Time 

Data  Input 

1.9  ms 

Overlap  and  Save:  FFT 

5.7  ms 

Beamforming/Filtering 

29.7  ms 

Alias  Correction 

0.1  ms 

Overlap  and  Save:  IFFT 

0.5  ms 

Phase  Correction 

0.6  ms 

Data  Output 

1.3  ms 

Total 

39.8  ms 

Those  computational  stages  where  the  execution  time  is  determined  by  the  memory  access  times  are 
those  stages  where  there  are  relatively  few  floating-point  operations  per  byte  of  data  accessed.  In  the  over¬ 
lap  and  save  FFT  stage,  where  the  computation  time  determined  the  overall  execution  time,  there  were 
266,240  flop  and  41,664  bytes  accessed  per  FFT,  for  a  ratio  of  nearly  six  and  a  half  flop  per  byte.  In  the 
alias  correction  stage,  where  the  memory  access  time  determined  the  overall  execution  time,  there  were 
396  flop  and  4,752  bytes  accessed  per  Hammerhead,  for  a  ratio  of  less  than  a  tenth  of  a  flop  per  byte.  Given 
that  the  ratio  of  the  peak  aggregate  computational  throughput  of  the  three  Hammerheads  to  the  peak  band¬ 
width  of  the  shared  memory  bus  is  at  least  three  flop/s  per  byte/s  and  as  high  as  six  flop/s  per  byte/s,  mem¬ 
ory  access  time  will  continue  to  drive  the  execution  time. 

6.2  DEFAULT  ALGORITHM  MODE 

6.2.1  Data  Input 

In  the  default  algorithm  mode,  each  data  block  has  2,048  samples  for  each  channel.  With  as  many  as 
18  channels  of  data  being  brought  into  a  cluster,  the  total  amount  of  data  per  data  block  is  147  KB. 

As  was  done  in  the  original  algorithm  mode,  the  input  Hammerhead  will  accumulate  several  samples 
before  writing  them  all  to  DRAM.  In  the  default  mode,  we  will  establish  two  buffers,  each  capable  of  stor¬ 
ing  1  /4th  of  the  sample  extent,  or  5 1 2  samples,  for  all  1 8  channels.  Each  buffer  will  therefore  be  37  KB  in 
size. 
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The  measured  sustained  bandwidth  of  the  shared  memory  bus  for  writes  to  DRAM  is  240  MB/s.  At 
this  rate,  the  time  to  copy  1  /4th  of  the  data  into  DRAM  is 


36,864  bytes  ^ 


240  X  1 0^  bytes 
second 


154  X  10  ^  s 


(59) 


One  such  copy  to  DRAM  will  occur  every  512  samples,  or  5.12  x  10 
ing  data  into  DRAM  is  614  x  10  s  per  data  block. 


s .  The  total  time  spent  copy- 


6.2.2  Overlap  and  Save  Filtering  FFT 


For  each  channel’s  worth  of  data,  a  4K-point  real  FFT  must  be  performed  as  the  first  step  in  the  over¬ 
lap  and  save  filtering. 

On-Chip  Memory  Usage,  The  2,048  samples  of  new  data  plus  the  2,048  samples  of  old  overlap  data 
occupy  16  KB.  Even  though  only  16  KB  are  copied  from  DRAM  to  SRAM  for  a  single  vector,  the  input 
data  vector  will  occupy  twice  that  space  -  33  KB  -  because  the  vector  must  be  large  enough  to  store  the 
complex  results  of  the  in-place  FFT. 

To  keep  the  processor  busy  performing  FFTs,  we  will  set  up  two  separate  buffers,  each  33  KB  in 
size.  While  the  processor  is  performing  an  FFT  on  the  data  in  one  buffer,  the  other  buffer  will  cither  be 
filled  with  more  data  from  DRAM  or  copied  back  into  DRAM.  These  two  buffers  together  will  occupy  66 
KB  of  the  on-chip  SRAM.  We  will  also  need  to  store  in  SRAM  the  FFT  weights,  which  will  occupy  16 
KB. 


Memory  Access:  DRAM  to  SRAM,  The  16  KB  of  input  data  must  be  copied  from  DRAM  to  SRAM 
over  the  shared  memory  bus.  The  measured  sustained  bandwidth  of  the  shared  memory  bus  for  reads  from 
DRAM  is  176  MB/s.  At  this  rate,  the  time  to  copy  the  input  data  from  DRAM  is 


16,384  bytes  -s- 


176  X  10^  bytes 
second 


93x  10 


(60) 


FFT  Processing,  A  single  4K-point  real  FFT  requires  122,880  flop.  The  measured  sustained 
throughput  of  the  Hammerhead  on  a  real  4K  FFT  is  307  Mflop/s.  At  this  rate,  the  time  for  a  single  4K- 
point  real  FFT  is 


122,880  flop  -  307  x  10  flop  ^  490  x  10 


second 


(61) 


Memory  Access:  SRAM  to  DRAM,  After  the  4K-point  FFT,  we  are  only  interested  in  140  frequency 
bins  for  one  subband,  which  will  occupy  1  KB.  The  measured  sustained  bandwidth  of  the  shared  memory 
bus  for  writes  to  DRAM  is  240  MB/s.  At  this  rate,  the  time  to  copy  the  frequency-domain  data  to  DRAM  is 
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1 , 1 20  bytes  h- 


240  X  10^  bytes 
second 


5x  10  % 


(62) 


Simultaneous  Operation  of  Three  Hammerheads  in  a  Cluster.  The  restrictions  on  the  use  of  the 
shared  memory  bus  and  the  overlapping  of  computation  and  memory  access  that  existed  for  the  original 
algorithm  mode  also  apply  to  the  default  algorithm  mode.  We  can  estimate  the  total  execution  time  for  this 
stage,  which  will  be  dictated  by  the  third  Hammerhead  in  the  first  cluster: 

•  wait  for  the  first  Hammerhead  to  load  a  4K-point  real  vector:  93  x  10  ^  s 

•  wait  for  the  second  Hammerhead  to  load  a  4K-point  real  vector:  93  x  10  ^  s 

•  load  a  4K-point  real  vector:  93  x  10  ^  s 

•  perform  4K-point  real  FFTs  for  six  channels:  2,400  x  10  ^  s 

•  store  one  140-point  complex  vector:  5  x  10  ^  s 

for  a  total  of  2,684  x  10  ^  s  . 


6.2.3  Bcamforming  and  Filtering:  Step  1 


Each  Hammerhead  forms  partial  bcamforming  sums  from  five  or  six  channels’  worth  of  data,  then 
redistributes  the  data  prior  to  completing  the  beamforming  operation.  The  mapping  of  the  default  algo¬ 
rithm  mode  onto  the  DR  is  the  same  as  the  mapping  for  the  original  algorithm  mode. 

On-Chip  Memory  Usage.  The  element-space  data  for  six  channels  and  140  frequency  bins  will 
occupy  7  KB.  The  beamforming/filtering  weights  vectors  for  two  channels  (we  use  one  weights  vector 
while  loading  the  next  weights  vector)  occupy  2  KB.  The  partial  beamforming  sum  vector  will  require  1 
KB.  Together,  these  data  products  will  require  10  KB. 

Memory  Access:  DRAM  to  SRAM.  The  element-space  data  for  six  channels  and  140  frequency  bins 
will  occupy  7  KB.  The  measured  sustained  bandwidth  of  the  shared  memory  bus  for  reads  from  DRAM  is 
176  MB/s.  At  this  rate,  the  time  to  copy  the  element-space  data  from  DRAM  is 


6,720  bytes 


176  X  10^  bytes 
second 


38x  10  S 


(63) 


The  beamforming/filtering  weights  for  one  channel  will  occupy  1  KB.  The  time  to  copy  the  weights 
from  DRAM  is 


1 , 1 20  bytes  h- 


176  X  10^  bytes 
second 


6x  10  % 


(64) 


Beamforming.  Adding  the  partial  beamfomiing  sums  from  one  channel  for  all  140  frequency  bins 
will  require  1,120  flop.  The  measured  sustained  throughput  of  the  Hammerhead  on  a  multiply-accumulatc 
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operation  of  this  size  is  122  Mflop/s.  At  this  rate,  the  time  for  this  multiply-accumulate  operation  is 


1,120  flop- 


122  X  10^  bytes 
second 


9x  10  S 


(65) 


Memory  Access:  SRAM  to  DRAM.  The  partial  beamfonning  sum  vector  will  occupy  1  KB.  The 
measured  sustained  bandwidth  of  the  shared  memory  bus  for  writes  to  DRAM  is  240  MB/s.  At  this  rate, 
the  time  to  copy  the  partial  sum  vector  to  DRAM  is 


1,120  bytes  - 


240  X  10^  bytes 
second 


5x  10  S 


(66) 


Simultaneous  Operation  of  Three  Hammerheads  in  a  Cluster.  The  restrictions  on  the  use  of  the 
shared  memory  bus  and  the  overlapping  of  computation  and  memory  access  that  existed  for  the  original 
algorithm  mode  also  apply  to  the  default  algorithm  mode.  We  can  estimate  the  total  execution  time  for  this 
stage,  which  will  be  dictated  by  the  third  Hammerhead  in  the  first  cluster: 


•  wait  for  the  first  Hammerhead  to  load  the  element-space  data  for  six  channels:  38  x  10  ^  s 

•  wait  for  the  second  Hammerhead  to  load  the  element-space  data  for  six  channels: 

38x  10  S 

•  load  the  element-space  data  for  six  channels:  38  x  10  ^  s 

•  for  each  beam  band: 

-  load  the  beamforming/filtering  weights  for  all  six  channels  for  all  three  Hammerheads: 
6x  10  *sx3x6  =  115x  10  % 

-  perform  multiply/accumulate  for  the  sixth  channel:  9  x  10  ^  s 

-  store  partial  beamforming  sum  in  DRAM:  5x10  ^  s 


for  a  total  of  1,912  x  10  ^  s  for  14  beam  bands. 


6.2.4  Bcamforming/Filtcr  Application  Step  2 

In  the  second  step  of  beamforming/filtering,  we  accumulate  the  three  partial  sums  that  were  com¬ 
puted  within  a  single  Hammerhead  cluster.  Specifically,  each  Hammerhead  loads  from  DRAM  all  three 
partial  sums  for  one-third  of  the  ^  ^subbands  beam  bands,  where  each  input  partial  sum  is  com¬ 

puted  from  six  channels’  worth  of  element-space  data,  and  combines  them  to  form  one  set  of  output  partial 
sums  for  one-third  of  the  beam  bands. 

Before  we  load  the  three  partial  sums  computed  during  the  first  step  of  beamfomiing/filtcring,  we 
need  to  synchronize  the  three  Hammerheads  in  each  cluster  to  force  the  processors  to  wait  for  the  previous 
step  to  complete. 
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On-Chip  Memory  Usage.  There  is  sufficient  space  in  the  on-chip  SRAM  to  allow  all  140  frequency 
bins  to  be  processed  in  one  batch.  One-third  of  14  beam  bands,  rounded  up,  is  five  beam  bands.  The  three 
sets  of  input  partial  sums  for  five  beam  bands  and  140  frequency  bins  will  occupy  17  KB.  The  output  par¬ 
tial  sums  for  five  beam  bands  and  140  frequency  bins  will  occupy  6  KB.  Together,  these  data  products  will 
require  22  KB. 

Memory  Access:  DRAM  to  SRAM.  The  three  sets  of  input  partial  sums  will  occupy  17  KB.  The 
measured  sustained  bandwidth  of  the  shared  memory  bus  for  reads  from  DRAM  is  176  MB/s.  At  this  rate, 
the  time  to  copy  the  input  partial  sums  from  DRAM  is 


1 6,800  bytes  -s- 


176  X  10^  bytes 
second 


95x  10  % 


(67) 


Beamforming.  Forming  the  cluster-wide  partial  sums  for  five  beam  bands  and  140  frequency  bins 
will  require  2,800  flop.  The  measured  sustained  throughput  of  the  Hammerhead  on  matrix  operations  of 
this  size  is  38  Mflop/s.  At  this  rate,  the  time  for  the  second  step  of  beamforming  and  filter  application  is 


2,800  flop  ^ 


38  X  10^  flop 
second 


73  X  10  S 


(68) 


Memory  Access:  SRAM  to  DRAM.  The  cluster-wide  partial  sums  for  five  beam  bands  and  140  fre¬ 
quency  bins  will  occupy  6  KB.  The  measured  sustained  bandwidth  of  the  shared  memory  bus  for  writes  to 
DRAM  is  240  MB/s.  At  this  rate,  the  time  to  copy  the  partial  sums  to  DRAM  is 


5,600  bytes  ^ 


240  X  10^  bytes 
second 


23  X  10‘S 


(69) 


Simultaneous  Operation  of  Three  Hammerheads  in  a  Cluster.  The  restrictions  on  the  use  of  the 
shared  memory  bus  and  the  overlapping  of  computation  and  memory  access  that  existed  for  the  original 
algorithm  mode  also  apply  to  the  default  algorithm  mode.  We  can  estimate  the  total  execution  time  for  this 
stage,  which  will  be  dictated  by  the  third  Hammerhead  in  the  first  cluster: 

•  wait  for  the  first  Hammerhead  to  load  its  three  sets  of  input  partial  sums:  95  x  10  ^  s 

•  wait  for  the  second  Hammerhead  to  load  its  three  sets  of  input  partial  sums:  95  x  10  ^  s 

•  load  the  three  sets  of  input  partial  sums:  95  x  10  ^  s 

•  add  the  three  partial  sums:  73  x  10  ^  s 

•  store  the  output  partial  sum  in  DRAM:  23  x  10  ^  s 

for  a  total  of  383  x  10  ^  s  for  all  140  frequency  bins. 
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6.2.5  Beamforming  and  Filtering:  Inter-Cluster  Communieation 


Prior  to  the  final  step  of  beamforming  and  filter  applieation,  inter-cluster  communication  is  neces¬ 
sary  to  exchange  beamforming  partial  sums  between  clusters.  At  the  end  of  the  beamfonning  and  filter 
application  stage,  we  would  like  to  have  the  beams  equally  distributed  over  the  three  Hammerhead  clus¬ 
ters,  with  the  entire  frequency  subband  in  a  single  Hammerhead  cluster.  This  arrangement  will  allow  for 
the  remaining  stages  of  the  BSAR  processing  to  be  done  local  to  a  single  Hammerhead. 

Before  we  transfer  the  partial  sums  between  the  Hammerhead  clusters,  we  need  to  synchronize  all 
nine  Hammerheads  performing  signal  processing  to  force  the  processors  to  wait  for  the  previous  step  to 
complete. 

Prior  to  the  inter-cluster  communication,  each  cluster  has  a  partial  sum  for  all  14  beam  bands  and 
140  frequency  bins.  After  the  inter-cluster  communication,  each  cluster  will  have  all  three  partial  sums  for 
a  third  of  the  14  beam  bands.  This  communication  operation  entails  sending,  in  the  worst  case,  five  beam 
bands  of  partial  sums  to  one  Hammerhead  cluster  and  five  beam  bands  of  partial  sums  to  the  other  Ham¬ 
merhead  cluster.  This  operation  can  be  performed  in  two  stages:  data  transfer  to  the  neighboring  cluster  to 
the  left,  and  data  transfer  to  the  neighboring  cluster  to  the  right.  In  both  directions,  the  worst  case  quantity 
of  data  to  be  transferred  is  1 1  KB  for  five  partial  sums  and  140  frequency  bins. 

On-Chip  Memory  Usage,  There  is  sufficient  space  in  the  SRAM  to  allow  all  140  frequency  bins  to 
be  transferred  at  one  time.  The  two  outgoing  sets  of  partial  sums  will  require  11  KB.  The  two  incoming 
sets  of  partial  sums  will  require  an  additional  1 1  KB,  for  a  total  of  22  KB. 

Memory  Access:  DRAM  to  SRAM.  The  two  sets  of  outgoing  partial  sums  will  occupy  1 1  KB.  The 
measured  sustained  bandwidth  of  the  shared  memory  bus  for  reads  from  DRAM  is  176  MB/s.  At  this  rate, 
the  time  to  copy  the  outgoing  data  from  DRAM  is 

1 1,200  bytes  -i-  *76x  10  bytes  ^  54  ^  lo  S  (70) 

second 


Inter-Cluster  Communication.  The  measured  sustained  bandwidth  of  the  link  ports  is  32  MB/s.  At 
this  rate,  the  time  to  perform  these  two  transfers  is 

1 1,200  bytes  ^  ^  =  350  x  10  S  (71) 

second 

Memory  Access:  SRAM  to  DRAM.  The  two  sets  of  incoming  partial  sums  will  occupy  1 1  KB.  The 
measured  sustained  bandwidth  of  the  shared  memory  bus  for  writes  to  DRAM  is  240  MB/s.  At  this  rate, 
the  time  to  copy  both  sets  of  partial  sums  to  DRAM  is 


1 1 ,200  bytes  ^ 


240  X  10^  bytes 
second 


47x  10  S 


The  total  time  needed  to  transfer  all  140  frequency  bins  is 


(72) 


50 


(73) 


« 


% 


(64  +  350  +  47)  X  10  ^  s  =  460  x  10  S 


6.2.6  Beamforming/Filter  Application  Step  3 


In  the  third  step  of  beamforming/filtering,  we  combine  the  partial  sums  aeross  the  three  Hammer¬ 
head  elusters.  Specifieally,  each  of  the  nine  Hammerheads  (three  Hammerheads  in  eaeh  of  the  three  clus¬ 
ters)  performing  signal  proeessing  will  combine  the  three  partial  sums  for  one-ninth  of  the  beam  bands  to 
produce  the  total  beamformed  sum. 

Before  we  load  the  three  partial  sums  eomputed  during  the  first  two  step  of  beamforming/filtering, 
we  need  to  synchronize  all  nine  Hammerheads  performing  signal  proeessing  to  force  the  processors  to  wait 
for  the  previous  step  to  complete. 

On-Chip  Memory  Usage.  One-ninth  of  the  14  beam  bands,  rounded  up,  is  two  beam  bands.  The 
three  sets  of  partial  sums  for  two  beam  bands  and  140  frequency  bins  occupy  7  KB.  The  output  total  sums 
for  two  beam  bands  and  140  frequeney  bins  oeeupy  2  KB.  Together,  these  data  produets  require  9  KB. 

Memory  Access:  DRAM  to  SRAM.  The  three  sets  of  partial  sums  will  oeeupy  7  KB.  The  measured 
sustained  bandwidth  of  the  shared  memory  bus  for  reads  from  DRAM  is  1 76  MB/s.  At  this  rate,  the  time  to 
copy  the  partial  sums  from  DRAM  is 


6,720  bytes  ^ 


176  X  10^  bytes 
seeond 


38x  10  S 


(74) 


Beamforming.  Forming  the  beamformed  output  for  two  beam  bands  and  140  frequeney  bins  will 
require  1,120  flop.  The  measured  sustained  throughput  of  the  Hammerhead  on  matrix  operations  of  this 
size  is  38  Mflop/s.  At  this  rate,  the  time  for  the  third  step  of  beamforming  and  filter  application  is 


1,120  flop- 


38x  10^  flop 

second 


29  X  10  S 


(75) 


Memory  Access:  SRAM  to  DRAM.  The  beamformed  output  for  140  frequency  bins  will  occupy  2 
KB  of  memory.  The  measured  sustained  bandwidth  of  the  shared  memory  bus  for  writes  to  DRAM  is  240 
MB/s.  At  this  rate,  the  time  to  copy  the  beamformed  output  back  to  DRAM  is 


2,240  bytes  - 


240  X  10^  bytes 
second 


9x  10  S 


(76) 


Simultaneous  Operation  of  Three  Hammerheads  in  a  Cluster.  The  restrictions  on  the  use  of  the 
shared  memory  bus  and  the  overlapping  of  computation  and  memory  aeeess  that  existed  for  the  original 
algorithm  mode  also  apply  to  the  default  algorithm  mode.  We  can  estimate  the  total  execution  time  for  this 
stage,  whieh  will  be  dictated  by  the  third  Hammerhead  in  the  first  cluster: 
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•  wait  for  the  first  Hammerhead  to  load  its  three  sets  of  partial  sums:  38  x  10  ^  s 

•  wait  for  the  second  Hammerhead  to  load  its  three  sets  of  partial  sums:  38  x  10  ^  s 

•  load  the  three  sets  of  partial  sums:  38  x  10  ^  s 

•  add  the  three  partial  sums:  29  x  10  ^  s 

•  store  the  output  total  sum:  9  x  10  ^  s 

for  a  total  of  153  x  10  ^  s  for  all  140  frequency  bins. 

We  can  now  calculate  the  projected  execution  time  for  the  entire  beamforming/filtcring  stage: 

•  step  1:  1,912  X  10  S 

•  step  2:  383  x  10  *  s 

•  inter-cluster  communication:  460  x  10  ^  s 

•  step  3:  153  X  10  ^  s 

for  a  total  of  2,908  x  10  ^  s  . 


6.2.7  Alias  Correction 

For  the  alias  correction  stage,  all  24  overlap  frequency  bins  for  all  beam  bands  local  to  a  single  Ham¬ 
merhead  can  be  copied  into  SRAM.  With  nine  Hammerheads  performing  computations,  in  the  worst  case, 
a  Hammerhead  will  have  two  beam  bands. 

On-Chip  Memory  Usage.  The  24  overlap  frequency  bins  for  two  beam  bands  will  occupy  384  bytes. 
The  12  aliased  frequency  bins  for  nine  beam  bands  will  occupy  192  bytes. 

Memory  Access:  DRAM  to  SRAM.  The  measured  sustained  bandwidth  of  the  shared  memory  bus 
for  reads  from  DRAM  is  176  MB/s.  At  this  rate,  the  time  to  copy  the  overlap  frequency  bins  from  DRAM 
is 


0  0.1  .  176x10  bytes  ^ 

384  bytes  -5- - —  =  2x10 

second 


(77) 


Alias  Correction.  Alias  correction  for  24  overlap  frequency  bins  and  two  beam  bands  will  require  48 
flop.  The  measured  sustained  throughput  of  the  Hammerhead  on  matrix  operations  is  38  Mflop/s.  At  this 
rate,  the  time  to  perform  alias  correction  is 

.on  38x10^  flop  , 

48  flop  =  1x10  s  (78) 

second 


V 


4 
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Memory  Access:  SRAM  to  DRAM.  After  alias  correction,  there  are  12  aliased  frequency  bins  for 
each  beam  band.  The  alias  corrected  bins  will  occupy  192  bytes.  The  measured  sustained  bandwidth  of  the 
shared  memory  bus  for  writes  to  DRAM  is  240  MB/s.  At  this  rate,  the  time  to  copy  the  aliased  bins  to 
DRAM  is 


4 


1 92  bytes  + 


240  X  10^  bytes 
second 


1  X  10  ^  s 


(79) 


Simultaneous  Operation  of  Three  Hammerheads  in  a  Cluster.  Because  the  computation  time  for 
alias  correction  is  more  than  one-third  of  the  SRAM  to  DRAM  memory  access  time,  this  memory  access 
cannot  be  hidden  in  the  computation  time,  and  the  execution  time  for  alias  correction  will  be  dictated  by 
the  total  memory  access  time.  The  sum  of  the  memory  access  times  for  three  Hammerheads  is 

3x[(2x  10  %)  +  (!  X  10  S)1  =  9x  10  S  (80) 


6.2.8  Overlap  and  Save  Filtering  IFFT 

For  the  IFFT  portion  of  overlap  and  save  filtering,  we  will  employ  two  buffers,  as  was  done  with  the 
FFT  portion  of  overlap  and  save  filtering,  to  allow  simultaneous  processing  and  memory  access. 

On-Chip  Memory  Usage.  Two  buffers  each  capable  of  storing  a  complex  128-point  vector  occupy  2 
KB.  The  IFFT  weights  vector  will  require  an  additional  512  bytes. 

Memory  Access:  DRAM  to  SRAM.  The  measured  sustained  bandwidth  of  the  shared  memory  bus 
for  reads  from  DRAM  is  176  MB/s.  At  this  rate,  the  time  to  copy  a  complex  128-point  vector  from  DRAM 
is 


1 ,024  bytes  -s- 


176  X  10^  bytes 
second 


6x  10^  s 


(81) 


IFFT  Processing.  A  single  128-point  complex  FFT  will  require  4,480  flop.  The  measured  sustained 
throughput  of  the  Hammerhead  on  FFTs  is  250  Mflop/s.  At  this  rate,  the  time  to  perform  one  128-point 
complex  IFFT  is 


> 


%- 


4,480  flop  -5- 


250  X  10^  flop 

second 


18  X  10  S 


(82) 


Memory  Access:  SRAM  to  DRAM.  After  the  IFFT,  we  are  only  interested  in  the  64  non-overlap  time 
samples,  which  will  occupy  512  bytes.  The  measured  sustained  bandwidth  of  the  shared  memory  bus  for 
writes  to  DRAM  is  240  MB/s.  At  this  rate,  the  time  to  copy  the  complex  64-point  vector  to  DRAM  is 


5 1 2  bytes  ^ 


240  X  1 0^  bytes 
second 


2x  10  S 


(83) 
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Simultaneous  Operation  of  Three  Hammerheads  in  a  Cluster.  The  restrictions  on  the  use  of  the 
shared  memory  bus  and  the  overlapping  of  computation  and  memory  access  that  existed  for  the  original 
algorithm  mode  also  apply  to  the  default  algorithm  mode.  We  can  estimate  the  total  execution  time  for  this 
stage,  which  will  be  dictated  by  the  third  Hammerhead  in  the  first  cluster: 

•  three  copies  for  the  first  beam  band  from  DRAM  to  SRAM:  17  x  10  ^  s 

•  three  copies  for  the  first  beam  band  from  SRAM  to  DRAM:  6x10  ^  s 

•  three  copies  for  the  second  beam  band  from  DRAM  to  SRAM:  17  x  10  ^  s 

•  one  128-point  complex  FFT:  18  x  10  ^  s 

•  one  copy  for  the  second  beam  band  from  SRAM  to  DRAM:  2x10  ^  s 
for  a  total  of  61  x  10  ^  s . 

6.2.9  Phase  Correction 

For  the  phase  correction  stage,  all  64  time  samples  for  all  beam  bands  local  to  a  Hammerhead  can  fit 
in  the  on-chip  SRAM.  With  nine  Hammerheads  performing  computations,  in  the  worst  case,  a  Hammer¬ 
head  will  have  two  beam  bands. 

On-Chip  Memory  Usage.  The  64  time  samples  for  two  beams  will  occupy  1  KB.  Because  phase  cor¬ 
rection  can  be  done  in-place,  we  will  not  need  separate  input  and  output  buffers. 

Memory  Access:  DRAM  to  SRAM.  The  measured  sustained  bandwidth  of  the  shared  memory  bus 
for  reads  from  DRAM  is  176  MB/s.  At  this  rate,  the  time  to  copy  the  64  time  samples  for  nine  beam  bands 
from  DRAM  is 


1 ,024  bytes  -5- 


176  X  10^  bytes 
second 


6x  10  S 


(84) 


Phase  Correction.  Phase  correction  for  64  time  samples  and  two  beam  bands  will  require  768  flop. 
The  measured  sustained  throughput  of  the  Hammerhead  on  matrix  operations  of  this  size  is  38  Mflop/s.  At 
this  rate,  the  time  to  perform  phase  correction  is 


nao  38  X  10^  flop 

768  flop  ^ ^ — - 
second 


20  X  10  ^  s 


(85) 


Memory  Access:  SRAM  to  DRAM.  The  measured  sustained  bandwidth  of  the  shared  memory  bus 
for  writes  to  DRAM  is  240  MB/s.  At  this  rate,  the  time  to  copy  the  64  phase  corrected  time  samples  for 
nine  beam  bands  to  DRAM  is 


1 ,024  bytes  ^ 


240  X  10^  bytes 
second 


4x  10  S 


(86) 


4 


4 
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Simultaneous  Operation  of  Three  Hammerheads  in  a  Cluster,  The  restrictions  on  the  use  of  the 
shared  memory  bus  and  the  overlapping  of  computation  and  memory  access  that  existed  for  the  original 
algorithm  mode  also  apply  to  the  default  algorithm  mode.  We  can  estimate  the  total  execution  time  for  this 
stage,  which  will  be  dictated  by  the  third  Hammerhead  in  the  first  cluster: 

•  wait  for  the  first  Hammerhead  to  load  its  input  data:  6  x  10  ^  s 

•  wait  for  the  second  Hammerhead  to  load  its  input  data:  6  x  10  ^  s 

•  load  the  input  data:  6  x  10  ^  s 

•  correct  phase:  20  x  1 0  ^  s 

•  store  the  output  data:  4  x  10  ^  s 

for  a  total  of  42  x  1 0  ^  s . 

6.2.10  Data  Output 

Each  Hammerhead  cluster  has,  in  the  worst  case,  five  beam  bands  worth  of  data.  Therefore,  the  total 
quantity  of  data  to  be  transferred  to  the  output  FPGA  from  a  single  Hammerhead  cluster  is  3  KB.  The  mea¬ 
sured  sustained  bandwidth  of  the  link  port  is  32  MB/s.  At  this  rate,  the  time  to  copy  the  data  to  the  output 
FPGA  is 


->  u  *  32x10  bytes  „„  6 

2,560  bytes  -i- - f —  =  80  x  1 0  s 

second 


(87) 


> 
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6.2.1 1  Timing  Summary 

The  estimated  execution  time  for  the  various  stages  are  summarized  below  in  Table  8. 

Table  9:  Estimated  Execution  Time  (Default  Algorithm  Mode) 


Stage 

Estimated  Execution 
Time 

Data  Input 

0.6  ms 

Overlap  and  Save:  FFT 

2.7  ms 

Beamforming/Filtering 

2.9  ms 

Alias  Correction 

0.0  ms 

Overlap  and  Save:  IFFT 

0.1  ms 

Phase  Correction 

0.0  ms 

Data  Output 

0.1  ms 

Total 

6.4  ms 

4 
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7.  SUMMARY  AND  CONCLUSIONS 


In  this  report,  we  have  described  both  the  BSAR  DR  frequency-domain  algorithm  and  the  1 2-Ham- 
merhcad  DR  architecture.  Given  these  two  components,  we  have  proposed  a  mapping  of  the  algorithm 
onto  the  architecture,  and  have  estimated  the  execution  time  for  both  the  original  algorithm  mode  and  the 
new  default  mode  using  measurements  of  the  computation  and  communication  performance. 

Given  the  proposed  mapping  and  these  efficiency  estimates,  we  project  that  execution  of  the  DR 
algorithm  in  the  original  mode  will  take  approximately  39.8  ms,  which,  after  considering  the  need  for  50% 
processor  spare,  is  within  the  42.9  ms  available  (the  39.8  ms  estimated  execution  time  leaves  62%  spare 
processor  capacity).  The  algorithm  in  the  default  mode  will  take  approximately  6.4  ms,  which  is  within  the 
13.7  ms  available  (the  6.4  ms  estimated  execution  time  leaves  220%  spare  processor  capacity).  A  key 
observation  from  this  mapping  analysis  is  that  the  memory  access  time  will  heavily  dominate  the  overall 
execution  timing  and,  given  the  likely  contention  issues,  will  therefore  be  a  risk  area.  Alternate  mappings 
that  make  greater  use  of  the  link  port  connections  between  Hammerheads  to  alleviate  some  of  the  load  on 
the  shared  memory  bus  were  analyzed  in  the  original  version  of  this  report  and  were  found  to  have  worse 
pcrfonnancc. 

Our  analysis  shows  that  the  BSAR  DR  will  be  able  to  perform  its  signal  processing  within  the  avail¬ 
able  time,  although  it  is  possible  that  some  of  the  processor  margin  will  be  consumed  if  shared  memory  bus 
contention  is  worse  than  we  anticipate.  Also,  we  expect  that  the  DR  will  have  sufficient  SRAM  and 
DRAM  to  store  input,  intermediate,  and  output  data,  although,  with  the  original  algorithm  parameters, 
some  stages  will  consume  significant  portions  of  the  memory  margin. 
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LIST  OF  ACRONYMS 


* 


i 


) 


4 


ADC 

ADCAP 

AGC 

ALU 

BES 

BSAR 

CBASS 

DMA 

DR 

DRAM 

DSP 

ENOBs 

EFT 

FIR 

flop 

flop/s 

FPGA 

GCB 

IFFT 

KB 

MB 

NUWC 

PROM 

SIMD 

SDRAM 

SNR 

SRAM 


Analog  to  Digital  Converter 

ADvaneed  CAPability 

Automatic  Gain  Control 

Arithmetic  and  Logic  Unit 

Broadband  Evaluation  System 

Broadband  Sonar  Analog  Receiver 

Common  Broadband  Advanced  Sonar  System 

Direct  Memory  Access 

Digital  Receiver 

Dynamic  Random-Access  Memory 
Digital  Signal  Processor 
Effective  Number  Of  Bits 
Fast  Fourier  Transform 
Finite  Impulse  Response 

FLoating-point  OPeration,  a  measure  of  processing  workload 

FLoating-point  Operations  per  second,  a  measure  of  processing  throughput 

Field-Programmable  Gate  Array 

Guidance  and  Control  Box 

Inverse  Fast  Fourier  Transform 

Kilo  Byte,  or  1,000  bytes 

Mega  Byte,  or  1 ,000,000  bytes 

the  Naval  Undersea  Warfare  Center 

Programmable  Read-Only  Memory 

Single  Instruction  stream.  Multiple  Data  streams 

Synchronous  Dynamic  Random-Access  Memory 

Signal  to  Noise  Ratio 

Static  Random-Access  Memory 
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